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Abstract 

\^ . This article presents an algorithm for learning the essential graph of a Bayesian network. The 

Si^ , basis of the algorithm is the Maximum Minimum Parents and Children algorithm from [38J , with 

three substantial modifications. The MMPC algorithm is the first stage of the Maximum Minimum 
"^ ' Hill Climbing algorithm for learning the directed acyclic graph of a Bayesian network, introduced 

in [35]. The MMHC algorithm runs in two phases; firstly, the MMPC algorithm to locate the 
skeleton and secondly an edge orientation phase. The computationally expensive part is the edge 



^ I orientation phase. 

ly-s I The first modification introduced to the MMPC algorithm, which requires little additional 

^O ' computational cost, is to obtain the immoralities and hence the essential graph. This renders the 

edge orientation phase, the computationally expensive part, unnecessary, since the entire Markov 
structure that can be derived from data is present in the essential graph. 

(^ I Secondly, the MMPC algorithm can accept independence statements that are logically incon- 

sisted with those rejected, since with tests for independence, a 'do not reject' conclusion for a 
particular independence statement is taken as 'accept' independence. An example is given to illus- 

^1^ ■ trate this and a modification is suggested to ensure that the conditional independence statements 

are logically consistent. 

Thirdly, the MMHC algorithm makes an assumption of faithfulness. An example of a data set 
is given that does not satisfy this assumption and a modification is suggested to deal with some 
situations where the assumption is not satisfied. The example in question also illustrates problems 
with the 'faithfulness' assumption that cannot be tackled by this modification. 

1 Introduction 

To ensure that the paper is self contained, the definitions and main results used, that have been 
developed by other authors, will be stated in the appendix (section |9|). This article considers probability 
distributions over a set of random variables V = {Xi, . . . , X^} that may be represented by a Bayesian 
network (definition [20]) . Each variable is multinomial and the variables have a dependency structure 
that is unknown and is to be established from data. The algorithm in this article is intended for use 
for networks used to model domain knowledge in Decision Support Systems, particularly in medicine. 



An introduction to the topic is found in |19j, a more advanced treatment in Cowell et. al. [lOj and 
applications within medicine in [3j and implementation in [2]. 

Learning a Bayesian network (that is, locating the graph structure) from data is an important 
problem in several fields. It has been studied extensively over the last 15 years. A successful structure 
learning algorithm enables a Decision Support System to be constructed automatically. In bioinfor- 
matics, Bayesian networks learned from data have been used for locating gene regulatory pathways. 
This is discussed by Friedman et. al. in [13]. In some cases, for example with gene pathways, structure 
learning algorithms are used to infer causality. When causality is being learned, the essential graph 
(definition I52|) is more informative than a directed acyclic graph, because all the edges and only those 
edges for which the direction of an edge in the directed acyclic graph may be inferred from data are 
directed in the in the essential graph. Discussion of the use of Bayesian networks for inferring causality 
is discussed in |36) : clearly causality cannot be inferred purely from data. Such inferences can only be 
made if there are other assumptions, for example, that the data comes from a controlled experiment 
where the candidates for cause to effect are already clear. If only the data is used, then only those 
arrows that are directed in the essential graph have a clear causal interpretation and the causal link 
may only be drawn if there are additional modelling assumptions. 

Structure learning is also of importance for classification and variable selection problems as dis- 
cussed by Tsamardinos and Aliferis |39] and structure learning techniques provide the basis of algo- 
rithms for solving these problems |40| HT| . 

Biomedical sciences and genetics, with micro- array gene expression techniques, often produce data 
sets with tens or hundreds of thousands of variables, which require algorithms that enable learning 
of a Bayesian network in real time. Learning Bayesian networks from data, without any additional 
assumptions (for example, that there exists a directed acyclic graph that is faithful (definition |22]) to 
the distribution, or limiting the number of possible nodes in each parent set) is an NP-hard problem 
(see laEI). 

In general, there are two main approaches for learning Bayesian networks from data; search and 
score techniques and constraint based techniques. A search and score techniques looks for the network 
that maximises a score function, where the score function indicates how well the network fits the data. 
The score function may be a likelihood function, or a posterior probability function pgix{E\x.) where 
S is a structure and x is the data matrix. These algorithms search the space of possible structures 
for the structure E that maximised the score function, using a greedy algorithm, local algorithm or 
some other search algorithm. These are discussed in Cooper and Herskovitz [Oj, Heckerman, Geiger 
and Chickering |17] . 

Constraint based algorithms are discussed in Spirtes, Glymour and Scheines |3^. These algorithms 
investigate from the data whether or not there are certain conditional independence structures between 
the variables. This is performed using statistical or information theoretic measures. Networks that do 
not have all the conditional independence statements ascertained from the algorithms are eliminated 
from consideration. 

The constraint - based approach, found in the MMPC algorithm (the skeleton construction stage) 
of |38) has a problem that is illustrated with the example in subsection 17.11 with 1000 observations on 



6 binary variables. The statement 'do not reject independence' is taken as 'accept independence' and 
this acceptance is made without reference to other independence statements that have been rejected. 
The example (subsection 17. ip ) gives a situation where the statement X _L Y\{W,Z} is not rejected 
and hence X _L y|{T^, Z} is accepted, so that the edge {X, Y) is then removed from the graph, even 
though X JLY, X JL Y\Z and Y _L {W, Z}. It is a relatively easy computation to establish that 

Y±{W,Z} and X ±Y\{W,Z} ^ X ±Y\Z 

and therefore the statement X 1. Y\{W,Z} is incompatible with the statement X JL Y\Z that has 
already been accepted. 

The problem is that 'do not reject' only means 'there is insufficient evidence based on this single 
hypothesis test to reject.' It is therefore wrong to accept the independence statement if the rejection 
of other statements logically leads to the rejection of X _L Y\{Z, W}. 

The modification suggested here takes a hierarchical approach; a statement X _L Y\S is accepted 
if and only if the test statistic is sufficiently low and it is also consistent with all the conditional 
independence statements that may be derived from the data involving smaller conditioning sets. 

This modification increases the computational complexity. This increase may not be serious in very 
large examples, with a very large number of variables and a very large number of instantiations, where 
the time taken for a statistical call (definition [6]) is essentially due to computing the cell counts for 
the subsets of the variables involved, and once the cell counts for these variables have been computed, 
the time taken to make the computation for subsets of these variables is insignificant. For smaller 
examples, the increase in complexity may be substantial. 

In the small six variable example in subsection l7.H the modification gives a substantial improvement 
to the quality of the fitted distribution. 

The Maximum Minimum Parents and Children algorithm is introduced in |38j, to find the skeleton 
(definition |24|) of a directed acyclic graph that describes the independence structure. Following the 
discussion above, it is constraint based; it removes an edge {X, Y) from the skeleton if there is a set 
of variables S such that X ^ Y\S. It works in a similar manner to the PC algorithm, developed by 
Spirtes, Glymour and Scheines in |36j . 

The algorithm requires that there exists a directed acyclic graph that is faithful (definition |22|) 
to the probability distribution. That is, all conditional independence statements in the probability 
distribution are represented by d-separation statements (definition \W^ in the graph. The definition of 
faithful is found in |19) page 69 while the definition of d-separation is found in |19) page 49. These 
notions were developed and discussed in work by J. Pearl, D.Geiger and T. Verma, for example |28] 
and [30]. 

In motivating the 'faithfulness' assumption in j38] , they comment in section 2 (under the statement 
of theorem 1 in that article) that there are distributions for which there do not exist faithful Bayesian 
networks, but that these distributions are rare, as discussed by Meek in [24j. But the Women and 
Mathematics example (subsection I7.ip was a randomly chosen example on 6 variables, chosen only 
because of its convenience in size for a toy model and because it provided a useful comparison - other 
authors had already used it to test structure learning algorithms. It turned out that the dependence 



structure derived from data, where conditional independence is rejected when the null hypothesis is 
rejected at the 5% significance level, did not correspond to a distribution that had a faithful graphical 
representation. 

There are two ways that lack of faithfulness can appear. The first is where an algorithm based 
on the principle of adding in a edge in the skeleton between X and Y when and only when there is 
no conditioning set S such that X }-Y\S produces a skeleton that can represent all the associations, 
but any DAG will necessarily have d connection statements that do not correspond to conditional 
dependence. This suggests associations that do not exist, but the procedure developed in this article 
will produce a representation that gives an accurate model for the underlying distribution, in the sense 
of Kullback Leibler distance. 

More seriously, a distribution does not satisfy the faithfulness assumption if there is a situation 
where variable A is independent of C and variable B is independent of C, but variables A and B taken 
together tell us everything about C. This (to quote Edward Nelson [2B]) is the principle on which 
any good detective novel is based and joint distributions that have this property are not rare, at least 
within the social sciences. 

The example presented in section 17.11 is a randomly chosen example from the social sciences con- 
taining a group of variables that have this property. 

All graphs faithful to a given set of conditional independence statements are Markov equivalent 
(definition l30l) to each other. All the graphs in a set of directed acyclic Markov equivalent graphs have 
the same skeleton and the same immoralities (theorem 131 p . This result was established by T.Verma 
and J. Pearl in |44] . These ideas are discussed in [19] sections 4.4 and 4.5; the works of D. Geiger, J. 
Pearl and T.Verma |44) and [15] are important in this regard, as are important contributions by the 
authors D. Madigan, S.A. Andersson, M.D. Perlman and C.T. Volinsky (for example, |22)). Therefore, 
if there is a directed acyclic graph faithful to the distribution, then all faithful DAGs have the same 
skeleton and this unique skeleton is returned by the MMPC algorithm. 

The Maximum Minimum Hill Climbing algorithm was introduced in [38]. It consists of two stages; 
firstly the MMPC algorithm and secondly, having located the skeleton, a stage where the edges of 
the skeleton are oriented to find the best fitting DAG that describes the conditional independence 
statements of a probability distribution. In general, if a distribution has a faithful representation, 
there will be several possible faithful DAGs; any graph within the Markov equivalence class determined 
by the conditional independence statements will fit the data equally well. Only the immoralities 
(definition |23| and strongly protected edges (definition |33|) are the same in every faithful DAG. The 
background material is discussed in [19j sections 4.4 and 4.5; see |22j and [15]. There is therefore 
a level of inefficiency in an algorithm that operates by altering one edge at a time, adding a directed 
edge, or removing a directed edge, or changing the direction of a directed edge, that aims to find a 
suitable DAG; it may spend time comparing different DAGs that have the same essential graph. 

This article discusses a modification of the Maximum Minimum Parent Child algorithm that ensures 
that no conditional independence statements incompatible with known dependencies are accepted, and 
which also locates the immoralities. It then provides a straightforward routine to locate the essential 
graph (definition I 



A substantial evaluation of MMHC against the most widely used Bayesian network learning al- 
gorithms is given in [38] ■ These include the PC algorithm (Spirtes, Glymour and Scheines |36)). the 
Three Phase Dependency Analysis (Cheng, Greiner, Kelly, Bell and Liu |3]), the Sparse Candidate 
(Friedman, Nachman and Pe'er [13]), Optimal Reinsertion (Moore and Wong |25)). Greedy Equivalent 
Search (Chickering p]) and Greedy Search. 

2 Background 

The background material is given extensively in the appendix, section |9l to keep the article self con- 
tained; this section is restricted to a few brief remarks and some notations. The variables, and nodes 
of a graph are denoted by upper case letters, with indices where appropriate; for example, X, Xj. 
Sets of variables are denoted by upper case letters; for example, S. The meaning will be clear from 
the context. The probability function for a collection of random variables Xi, . . . , X^ will be denoted 
PXi,...,Xaj ^ function of d variables. An instantiation of a variable will be denoted by the same letter 
in lower case, e.g. Xi for an instantiation of variable Xi. Hence PXi,...,Xd is a function of d variables 
and PXi,...,Xai^ii • • • i ^d) denotes the probability that the random vector {Xi, . . . , X^) take the value 

{xi,...,Xd). 

Two variables X and Y are conditionally independent given a set of variables S = {Zi, . . . , Zra\ if 
for each {x,y,zi,. . . ,Zm) 

PX,Y,Zu...,Z,Ax, y,Zi,...,Zm) = PX\Zi,...,Z„,{x\zi, ..., Zm)PY\Zi,...,Z,„{y\zi, ■■■, ^m)PZi,...,Z,„(^l, • • • , ^m) 

where Px\Zi,...,Zm denotes the conditional probability function of X given (Zi, . . . , Z^)- The notation 
X J-Y denotes X independent of Y and the notation X _L y IS" denotes X independent of Y conditioned 
on a set of variables S. Similarly, for a single variable Z, X 1. Y\Z denotes X independent of Y given 
Z and for sets of variables A,B,S, A _L B\S denotes the variables in set A are jointly independent of 
the variables in set B given set S. The notation JL denotes the negation of _L; for example, X JLY\S 
denotes that X is not conditionally independent of Y given S. 

The definition of a Bayesian network is given in definition 1201 it involves the factorisation of a 
probability distribution according to a Directed Acyclic Graph (DAG), together with the DAG. A 
probability distribution pxi,...,Xd factorises as 

d 
PXi,...,Xa = Y[P^j\^i 

where for each j £ {1, . . . , d}, Ilj C {Xi, . . . , Xj-i} and the statement no longer holds if any of the 
Hi, . . . ,n^ are replaced by strict subsets. To the factorisation is associated a directed acyclic graph 
Q = {V,D), where V = {Xi, . . . ,Xci} and for each j, the parents of Xj are the sets Ilj. All the 
definitions are given in section |9j 

The connections between variables in a DAG are either fork, collider or chain and their definitions 
are given in definition [131 The definition of d-separation, given in definition [191 the definition of a trail 



is given in definition [Tl] The definition of blocked trail is given in definition [T71 a trail is blocked by a 
set of nodes S if there is a node W in the trail such that either 

• W is not a collider and W (z S or 

• W is a collider and neither W nor any of its descendants are in S. 

This was introduced by Pearl in |29j in 1988. Two nodes are d-separated by a set S if every trail 
between them is blocked by S and two sets of variables are d separated by S. A directed acyclic 
graph is said to be faithful to a probability distribution p if and only if every conditional independence 
statement in the distribution is represented by a d-separation statement in the DAG. This is given by 
definition |2l 

In every Bayesian network, a d-separation statement in the DAG represents a conditional indepen- 
dence statement in the distribution. There is a direct proof of this in theorem 2.2, page 66 of [ 19) : the 
result was established by Verma and Pearl (1988) [l2j . 

The MMPC algorithm locates an appropriate skeleton if and only if there exists a DAG Q such that 
p and Q are faithful to each other. Theorem l28l in the appendix (section [9|, stating that a graph faithful 
to a probability distribution contains an edge between two nodes X and Y if and only ii X JL Y\S 
was introduced by Spirtes, Glymour and Scheines in 1993 |35) and is the crucial characterisation to 
determine the edges that should be absent or present in the skeleton of a faithful graph. 

Algorithms following the constraint based approach estimate from the data whether or not certain 
conditional independence statements hold and, if they do, remove the appropriate edges. The MMPC 
algorithm in [38] will remove an edge {X,Y) from the skeleton if i^o^ ^ -L Y\S is not rejected even 
if this contradicts statements of dependence that have already been established. The modification of 
the MMPC algorithm presented in this article for determining conditional independence therefore only 
accepts a statement X J-Y\S provided that 

1. the result of the individual hypothesis test is 'based on this individual hypothesis test, do not 
reject the hypothesis' and 

2. accepting this statement does not contradict conditional dependence statements with smaller 
conditioning sets that have already been established. 

The modification of the MMPC presented in this article to locate the immoralities and hence the 
essential graph, requires some further background. An immorality, defined in definition [23l is a collider 
connection X ^ Y -i^ Z where X — Y — Z form a vee structure (definition [25]) . The key results, 
following the definition of Markov equivalence (definition |30| are that all faithful graphs are Markov 
equivalent and theorem |3T1 that two DAGs are Markov equivalent if and only if they have the same 
skeleton and the same immoralities. It follows that the immoralities are the same in all faithful graphs. 
If there is a structure where the skeleton contains edges {X,Y) and {Y,Z) but no edge {X,Z), then 
firstly, if the DAG is faithful, there is a set S such that X _L Z\S. Secondly, if {X, Y, Z) is an immorality, 
then X JL Z\R U {Y} for any subset R C V\{X, Y, Z}. In particular, for a distribution that possesses 
a faithful graphical representation, once the skeleton has been determined and once a set S such that 



X J- Z\S has been located, then {X, Y, Z) is an hnmorahty in a faithful DAG if and only if X ± Z\S. 
Furthermore, X ^ Z\S \J {Y}. 

Once the immoralities have been located, this gives rise to other directed edges, that necessarily re- 
tain the same direction in any faithful DAG. These are known as strongly protected edges (definition [ 
and the final phase of the algorithm locates these. 

3 Testing for Conditional Independence 



The divergence measure used by Tsamardinos, Brown and Aliferis in ||38j for determining whether a 
conditional independence statement held is the mutual information measure, based on the Kullback 
Leibler divergence. That is, for two variables X and Y and and a set of variables 5", let Px\s^ 'Py\s ^^^^ 
'Pxy\s denote the empirical probabilities computed from data. Then the mutual information between 
X and Y given S" = s is defined as 

EPXY\s{'X-,V\£} 
Px,Y\s{x, y\s) log rT\-- TTT 
^^y ' Px\s{xk)PY\s{y\s) 

oY^ f M nxx,s{x,y,s)ns{s) 

= 2y nx,Y,s{x,y,s_)\og r 

ti, nx,s{x,s)nY,s{y,s) 

where s denotes that the set S is instantiated as s, nx,Y,six,y,s) denotes the number of times that 
{X,Y,S) is instantiated as {x,y,s) in the data, similarly for nx,six,s), ny^siy^s) and ns{s). Under 
the hypothesis that X J-Y\S = s, 

I{X,Y\S = s) ~ X(|x|-i)(|y|-i) 
where \X\ and \Y\ denote the number of states of X and Y respectively. The following test statistic 
will be used to determine conditional independence. 

Definition 1 (Test Statistic). The following test statistic 

G{X,Y-S)= Y. IiX,Y\S = s) (1) 

s\nx,Y,s{^,yS>5 V(x,y) 
will be used. Since the sum of independent x^ distributions is again x^ the corresponding number of 
degrees of freedom, obtained by summing, is 

d = N>5{X,Y,S){\X\-l){\Y\-l) (2) 

where N>^{X,Y,S) denotes the number of states of S such that nx,Y,six,y,s) > 5 for all {x,y). If 
N{X, Y, S) = 0, then Hi : X ^Y\S is not rejected. 

The critical value is given by 

6iX,Y,S)=xl{l-a) (3) 

where a denotes the nominal significance level of 5%. 
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In |38) . the hypothesis of independence is accepted if G{X,Y;S) < Xdi^ ~ Q^)' where a denotes the 
nominal significance level of 5%. There is no adjustment to accommodate conditional independence 
relations obtained in this way that are incompatible with independence relations that are rejected. 

As indicated earlier, there are two problems with this. Firstly (and less seriously) the true signifi- 
cance level is clearly much higher than the nominal significance level; for large networks, the notion of 
'significance level' is meaningless - with any reasonable nominal significance level, and a large number 
of variables, it is not possible to infer that an an independent replication of the experiment will fit the 
network produced. 

This is not so serious, because the aim is not so much to establish a network that will be reproduced 
by a independent data set (which is not to be expected with large numbers of variables), but rather to 
find a network that fits the dependence structure exhibited in the current data set. For this, the fact 
that the nominal significance level is 5% for each test is not a problem; it is simply a declaration of 
the dependence level that is necessary to be declared an 'association'. 

The second, and much more serious problem with this is that the procedure, is the one indicated 
earlier that leads to the modification of the MMPC algorithm, for determining the conditional indepen- 
dence statements that are to be represented by d-separation statements in the graphical model. The 
Women and Mathematics example, on six variables, gives a situation where conditional independence 
statements are accepted, leading to removal of edges and d-separation statements in the graphical 
model that contradict dependence statements in the distribution. If a single hypothesis test does not 
reject Hq : X 1. Y\S, then the algorithm described by [38J accepts this CI statement, hence the skele- 
ton that does not contain the edge {X,Y), even though this leads to d-separation statements in the 
graphical model where the probability distribution has dependence. 

The following theorem is a consistency results, indicating that if a set of CI statements is consistent, 
then if X _L Y\WUZ and W -LY\Z then the statement X J-Y\Z must also be in the set of conditional 
independence relations. 

Theorem 2. Let X and Y denote two variables and let W, Z denote two disjoint sets of variables, 
neither containing X orY. If X 1. Y\W U Z then 

Y ±W\Z ^ X ±Y\Z. 



Proof liX ± Y\WU Z then, for x, y, w, z such that px,Y,w,z > 0, px,Y,z,w = p^ J''^''' ■ It follows 



Px,w,zPy,w,z 

u± J.1 ^v _L i jn- ^^ uiicii, iwi a,, y, uy, 4^ ouv.li ^•-•-aL yx,Y,W,Z, -^ ^t FX,Y,Z,W — 

that 

Px,w,zPy,w,z 



Y ±W\Z =^ Px,Y,w,z 



Pw,z 
_ Px,w,zPy,zPw,z _ Px,w,zPy,z 

pw,zpz pz 

X ±Y\Z. 



□ 



The Women and Mathematics data set This data set, discussed in more detail in subsection l7.ip . 
presents a situation with four of the variables, B, C, D, E where, using a nominal significance level of 
5%, tests give: C ^ E^ C X- E\D^ but C -L B and C -L D are not rejected. The chi squared test 
applied to Hq : C _L E\{B, D} also fails to reject the null hypothesis. 

But, by theorem^ if E ± C\{B, D}, then {B,D} ±C ^ C ± E, while the result of the hypothesis 
test is: reject C -L E and accept C ^ E. This is a specific example where accepting conditional 
independence when the result of the hypothesis test is 'insufficient information to reject conditional 
independence' leads to a graphical model that does not represent all the associations between the 
variables. Since the hypothesis E _L C\{B^D} is not rejected, the final skeleton from the MMPC 
algorithm does not contain an edge between E and C. Neither does it contain edges {E, B) or {E, D). 
Therefore, in the corresponding graphical model for variables B,C,D,E, E is d-separated from C, 
even though E JL C. D 

The procedure to correct this uses three results; theorem |2] above and theorems [3] and |4] below. 

Theorem 3. Let V = {Xi, . . . ,Xd} denote a set of d variables. Let S = {X^^, . . . ,Xk^} C V. Then, 
forXi,Xj^S, ifX,±Xj\S 

Xi -L Xk^ I {Xfci , . . . , Xk^_-^ ,Xj} 4^ Xi ± {Xk^ ,Xj}. 
Proof of theorem [3] Let R = {Xk^, . . . ,Xk^_-^^} and assume that Xi _L Xj\S. Recall that 

ps{s) > 



PX^,S(^^SPXJ,S{X]S 



Xi ± Xj\S ^ px,,x,,s{xi, Xj,s) = { Psfe) 

I ps{s) = 



If 



PX„SPXj,S 
PXi,XjS = 



PS 
for all s such that psis) > 0, then, for all s such that psis) > 0, it follows that 

PXi,Xj,RPXj,S 
PX„Xj,S = ^ PX,\S = PX,\RU{Xj} 

Pru{Xj} 

so that 

Xi -L Xk^ I {Xk^ ,..., Xk^^^ ,Xj} <:^ Xi ± {Xj , Xk^}. 

as required. D 

Theorem 4. Let (-B(i,i))i<i<j<A; ^e a collection of numbers such that B{i,j) = 1 or for each {i,j). 
For any such collection (i?(i, j))i<j<j<fc, there is a collection of random, variables {Xi, . . . ,Xk} such 
that Xi _L Xj if B{i,j) = and Xi / Xj ifB{i,j) = 1. 



Proof of theorem [4] The proof consists of showmg that there exists a Bayesian network, that may 
contain more than k variables, such that Xi and Xj are d-separated by the empty set if B{i,j) = 
and where Xi and Xj are d-connected (when aU other variables are not instantiated) when B{i,j) = 1. 

Consider a directed acyclic graph constructed in the following way. Start with a graph containing 
nodes {Xi, . . . ,Xi.} and no edges in the graph. Whenever B(i,j) = 1, add a node Tij and directed 
arrows Tij i— )■ Xi and T^- i— )■ Xj. For B(i,j) = do not add any additional nodes and do not add any 
directed arrows to the graph. 

For the graph constructed in this way, Q = (W, D) where W is the node set and D is the directed 
edge set, consider any distribution p for which G = {W,D) is a faithful graphical model. It is possible 
to construct such a probability distribution over the variables in W. Then, since the original variables 
{Xi, . . . ,Xj:} only appear in collider connections, and since an additional variable Tij only appears in 
fork connection between Xi and Xj, it follows that Xi J^ Xj <^ B{i,j) = 1. Now marginalise over the 
variables (Tjj)(j ,v^(j ,w]^ and the distribution thus obtained pxi,...,Xk satisfies the desired property. D 

3.1 Determining Conditional Independence: The Procedure 

The M^PC algorithm (modified MMPC algorithm) modifies the procedure to decide whether a condi- 
tional independence statement is to be included, in the following way. 

With an initialisation of A{X, Y,S) =2 for each {X, Y, S) before the independence relations have 
been established, let A(X,Y; S) = 1 denote that X JL Y\S is to be included in the set of dependence 
relations and let A{X,Y; S) = denote that X _L Y\S is to be included in the set of dependence 
relations. Once a value of or 1 has been established, it does not change for the remainder of the 
procedure. Those which do not have a value of or 1 by the end of the procedure are not used to 
determine which edges the graph contains. 

The procedure computes values of A{X, Y; S) starting with S = (p and only computing A{X, Y; S) 
once A(X, Y; R) has been established for all R C S (strict subsets). 

For two variables X and Y and a subset S, the value A{X, Y; S) =0 is established if and only if 
the following three criteria are satisfied: 

1. G{X,Y\S) < 6{X,Y;S), where G{X,Y\S) and 5iX,Y;S) are defined by equation [U 

2. for each Z ^ S, 



A{Y, Z- S\{Z}) = ^ A{X, Y- S\{Z}) = 



and 



3. For each Z & S, 



A{X, Z; S\{Z}) = ^ A{X, Y; S\{Z}) = 0. 
A{X, Z; {Y} U S\{Z}) =0^X ±{Y,Z} 
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and 

A{Y, Z; {X} U S\{Z}) =0^Y ±{X, Z}. 

If either of the first two criteria is not satisfied, the value A{X, Y;S) = 1 is returned. If the first two 
are satisfied, but the third is not satisfied, then if X / {Y, Z}, but A{X, Z; {Y} U S\{Z}) = 0, then 
set A{X,Y;S) = 1. 

The value A{X, Y;S)=0 is returned if and only if all three criteria are satisfied. 

This could be modified in the following way: if X JL {Y,Z}, but A{X,Z;{Y} U S\{Z}) = 0, then 
set A{X, Z; {Y} U S\{Z}) = 1 if G{X, Z; {Y} U S\{Z}) > G{X, Y; S). If G{X, Z; {Y} U S\{Z}) < 
G{X, Y; S), then set A{X, Y; S) = 1. 

If y / {X, Z}, but A(Y, Z; {X} U S\{Z}) = 0, then set A{Y, Z; {X} U 5\{Z}) = 1 if G{Y, Z; {X} U 
S\{Z}) > G{X, Y; S). If G{Y, Z; {X} U S\{Z}) < G{X, Y; S), set A{X, Y; S) = 1. 

Theorem 5. The statements A{X,Y;S) decided according to the procedure given above, will he con- 
sistent with each other. That is, there exists a collection of random variables such that for any pair of 
variables X, Y and a subset S, X JL Y\S when A{X, Y; S) = 1 and X _L Y\S when A(X, Y; S) = 0. 

Proof The proof is by induction on the number of variables permitted in the conditioning set. By 
theorem HI no inconsistencies are introduced by the statements where the conditioning set is (p. 

Assume that there are no inconsistencies when all the values A{X, Y; S) are considered for all 
{X,Y,S) such that X,Y S, and 15"! < n, where |S| denotes the number of variables in S. A 
statement A{X,Y; S) where IS*! = n will be referred to as a statement of order n. 

Now consider a statement A{X, Y;S), where IS"! = n + 1, derived according to the principles listed 
above. If A{X,Y;S) = 1, then it is not inconsistent with any previous statements. If A{X,Y,S) = 0, 
then 

AiX,Y;S) = ^ px,Y,s = ^"^'^^^'^ ys:ps{s)>0 (4) 

PS 

It always holds that px,Y,s = Px\y,sPy,s = Py\x,sPx,s- If 15*1 = n + 1, then there is a conditional 
independence relation of order n + 1 between the variables in the set {X, y} U S* if and only if either 



Px,Y,s = Px\rPy,s for some R C S U {Y} or Px,Y,s = Px,SPy\r for some Rc SU {X}. 

Without loss of generality, only consider the first of these; px,Y,s = Px\rPy,s where i? is a strict subset 
of S U {y}, since the second is similar. Let S = {Zi, . . . , Zn+i}- Then px,Y,s = Px\rPy,s where 
RQ {y}U5'\{Zj for some i € {1, . . . ,n+ 1} if and only if ^(X, Z^; {y}u5'\{Zi}) =0. Either y G R 
otY ^R. Suppose Y (^ R. Then if X _L yjS", it follows that 

X _L Zi|{y} U S\{Zi} ^ Px\Y,S = PX\Y,S\{Z,} = Px\s = Px\s\{z,} 
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so that 

X 1 Z,\{Y} U S\{Zi} ^X ^{Y, Z,}\S\{Z,}. 

Suppose Y ^ R. Then if X _L y|5 it follows that 

X ± Z,\S\{Z,}^pxys = ^^'^\i^'>^"^^'^ = Px,s\iz.}Ps,Y ^ ^ ^ {Y,Z,}\S\{Z,}. 

PS\{Z,}PS PS\{Z,} 

From the construction method, if the value A{X,Zi;{Y} U S\{Zi}) = is already assigned, then a 
necessary condition for A{X, Y; S) = is that the values A{X, Y; S\{Zi}) = and A{X, Zi\ S'\{Zj}) = 
have already been assigned. 

If the value A(X^ Zi\ 5\{Zj}) = has been assigned, then a necessary condition for A{X^ ^; 5") =0 
is that the value A{X^Y\S\{^Zi\) = has already been assigned. 

The construction is such that all the conditional independence statements accepted are consistent 
with the conditional independence / dependence statements of a lower or equal order. D 

Therefore, if the procedure outlined above is adopted, the set of conditional independence statements 
that are accepted will be logically consistent with the set of conditional independence statements that 
are rejected in the hypothesis tests. 

4 The Modified Maximum Minimum Parents and Children Algo- 
rithm 

This section gives the Maximum Minimum Parents and Children (MMPC) algorithm presented in |38| . 
followed by the modified Maximum Minimum Parents and Children (M'^PC) algorithm that is the main 
contribution of this article. The original MMPC algorithm is presented in a way that makes it clear, 
when compared with the M^PC algorithm, where the modification the locates the immoralities occurs. 
The M^PC algorithm runs in two stages: 

1. run the MMPC algorithm, with the modifications required to obtain both the skeleton and the 
immoralities, 

2. locate the additional edges that are strongly protected. 

Most of the work is in the first part; locating the protected edges after the immoralities have been 
located is relatively straightforward. The criterion for assessing the computational efficiency of an 
algorithm is the number of statistical calls that have to be carried out. A statistical call is defined as 
follows. 

Definition 6 (Statistical Call). A statistical call is a computation, for example, com,puting the value 
of G{Xi,Xj; S) (definitionUl equation^^), that m,easures a level of statistical association. 

This has become the standard measure of efficiency, following the criteria stated in |38) section 7, 
which followed Spirtes, Glymour and Scheines |36] and Margritis and Thrun |23| . The location of 
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strongly protected edges after the immoralities have been obtained does not increase the order of the 
computational time. When the data set is large (say order 5000 instantiations of 6000 variables), and 
the statistical call involves computing the marginal empirical probabilities, then the time taken for a 
statistical call is essentially the time taken to compute the cell counts. For example, if there are 5000 
observations and all the variables are binary, then since the tests require counts of greater than or 
equal to 5, the largest marginal that can be considered has less than 1000 cells, corresponding to less 
than 10 variables, since 2^° = 1024. 

4.1 The Maximum Minimum Parents and Children Algorithm 

The variable set is denoted by F = {Xi, . . . ,Xrf}. The MMPC algorithm introduced in [38j is the 
combination of algorithms [1] [2] and |3l run in the following sequence: stage 1: algorithm [1] stage 2: 
algorithmic! stage 3: algorithm |3l stage 4: algorithmic 

Stage 1 Each variable in turn, for j = 1, . . . , d, is taken as the 'target variable' (to use the notation 
of |38)). The first stage (algorithm [1]) produces for each variable Xj a set Zj which contains all the 
neighbours of Xj in the final graph, along with other variables that will be eliminated in later stages. 

Stage 2 The second stage (algorithm |2|) simply observes that if a variable X^ is not in the neighbour 
set of Xj , then Xj cannot be in the neighbour set of Xf. , so that if X^ ^ Zj , then Xj should be removed 
from Z\^. This stage of the algorithm is not computationally expensive, since it is purely algebraic and 
there are no statistical calls. This run of algorithm |2] is unnecessary, but improves computational 
efficiency; it reduces the numbers and sizes of subsets that have to be considered in the next stage. 

Stage 3 The third stage of the algorithm (algorithm |3]) may be computationally expensive if the 
sizes of the sets produced from stage 1 and stage 2 are large, since it can require searching through 
rather large collections of subsets. 

Starting with Zj as the set of possible neighbours of Xj after stage 2, let yj^h denote the set 
of possible neighbours of Xj after variables Xi,...,Xfc have been considered, where X^ has been 
eliminated if it was present in 3^j,a:-i and if a subset S of 3^j,fc-i was located such that Xj _L X]^^S. 
Consider a particular DAG that is faithful to the distribution. Clearly, X^ cannot be both an ancestor 
and a descendant of Xj in the DAG, otherwise the DAG contains a cycle. If X^ is an ancestor of Xj^ 
then since IIj C 5, taking S = 11^ will give a subset such that Xj _L Xfc|5, since all trails from X^ to 
Xj will have an instantiated chain or fork connection (if the trail passes through a parent of Xj) or an 
uninstantiated collider connection. Clearly, at least one of these necessarily occurs if X^ is an ancestor 
of Xj in a DAG. 

If Xfc is a descendant of Xj in the DAG, then it is possible that X^ may not be eliminated from 
3^j,fc-ii even if there is no directed edge (Xj,Xfc) (that is, from Xj to X^) in the DAG. This happens 
if and only if on every trail from Xj to X^, every chain node on the trail that is a neighbour of Xj, 
required to block the trail when only variables from the neighbour set of Xj are admitted, is also 
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a collider node on a different trail from Xj to X^, and the variable Z such that there is a collider 
connection Xj — > y ^ Z between Xj and Z is not in Zj. 

The configuration in figure [1] represents such a situation; there are two trails from Xj to X^, y is 
a chain node in one and a collider node in the other. It is required that Y is instantiated if the trail 
Xj — Y — Xfc is to be blocked by instantiating nodes from Zj. Suppose that, after stage 2, Z Zj, but 
{y, Xk} G Zj. If the variable Y is instantiated, then there is an open trail X^ — Z — Y — Xj. If y is 
not instantiated, then there is an open trail Xk — Y — Xj. In both cases, Xj and Xk are d-connected. 

In the example in figure [1] although Xk is not eliminated from Zj at this stage, the variable Xj 
will be eliminated from Z^, since Xj ± Xfc||g{y, Z}. 

In general, following the discussion above, by instantiating n^, the parents of X^, Xj will be 
removed from Zk. 

For a particular faithful DAG Q, if Xk is neither a parent nor a child of Xj in Q, then take 5 = IIj. 
Then every trail between Xj and X^ contains either an instantiated fork or chain (if the trail passes 
through a parent of Xj) or an uninstantiated collider (if it does not pass through a parent of Xj) and 
hence Xk is eliminated from Xj at this stage. 

Stage 4 Since, after stage 3, there is the possibility of neighbours that are not present in the skeleton 
of a faithful DAG, it is necessary and sufficient to run algorithm [2] again. After stages 1-4, the resulting 
graph obtained is the skeleton of a faithful graph; it is clear that \{ Zi, . . . , Z^ are the neighbour sets 
for Xi, . . . ,Xd respectively, then there is an edge (X, y) in the skeleton if and only if there does not 
exist a set S such that X _L Y\S. Following theorem 1281 it is therefore the skeleton of a faithful graph. 

Algorithm 1 The MMPC Algorithm, Stage 1 
for j = 1 , . . . , d do 

Set Zj^Q = (f) (the empty set) 
for i = 1, . . . ,(i do 
if i = J then 

oet ^jjj ^ ^j j— 1 
else 

if Xj _L Xj|^j^j_i then 

OcTJ -^j i — ''^^i i — 1 

else 

Set Zj^i = Zj.i^i U {Xj} 
end if 
end if 
end for 
Set Zj = Zj, 
end for 
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Algorithm 2 The MMPC Algorithm, Stages 2 and 4 



for j = 1 , . . . , d do 

Set yjfl = Zj. 

for k = I, . . . ,d do 

if Afc € yj,k-i but Aj ^ ^fc then 

Set yj^k = yj,k-i\{Xk} 
else 

Set yj^k = yj,k-i 
end if 
end for 
Set Zj = yj^d 
end for 



Algorithm 3 The MMPC Algorithm, Stage 3 



for j = 1 , . . . , d do 

Set yjfi = Zj. 

for k = 1, . . . ,d, do 

if Xk G yj,k-i and 35" C yj^k-i\{Xk} such that Xj ± Xk\S then 

Set yj^k = Zj^k-iMXk}- 
else 

Set yj^k = yj,k-i- 
end if 
end for 
Set Zj = yj^d 
end for 




Figure 1: Xj. not eliminated from Zj 
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4.2 The Modified MMPC Algorithm 

The modification of tlie MMPC algoritlim to obtain tlie skeleton and tlie immoralities is presented 
in algorithms El El [71 El |9] and [101 with algorithm [4] used to obtain conditional independence 
/ dependence relations that are consistent with each other. The additional strongly protected edges 
(definition [33| , whose directions are forced after the immoralities are determined, are located by algo- 
rithms [11] [12] and [13] carried out in that sequence. The M^PC algorithm is therefore the sequence: 
algorithms [5] [Gj [7] [8] [9] [TUl [H] [121 [13] run in that order, followed by algorithm [14] to ensure that a 
valid graphical model is returned even if the dependence / independence relations returned do not 
correspond to a probability distribution that has a faithful graphical model. 

Algorithm [4] is used whenever a conditional independence / dependence relation has to be estab- 
lished. 

The way that the modification for locating the immoralities to the MMPC algorithm is made, can 
be seen from the listings of algorithms [5] - [9] and comparing them with the corresponding routines 
for the MMPC algorithm. They are described below, comparing the various stages with the MMPC 
algorithm, showing that stages 1 - 5 of the algorithm return the correct skeleton and the immoralities 
and pinpointing where the original MMPC algorithm has been modified. 

Stage 6, which is listed as algorithm I10| constructs the sets of edges that are, at this stage directed 
and the remaining edges which are undirected. Stages 7,8 and 9 (algorithms [11] [12] and [T3l) then 
locate the remaining strongly protected edges (definition [33|, so that the resulting graph is the essential 
graph. 

At this stage, the algorithm would usually be finished, since (as discussed in []), underlying prob- 
ability distributions usually have a faithful graphical model. The problem is that, when establishing 
dependence from data, a dependence structure in the original distribution will not be accepted into 
the model if there is insufficient evidence from the data. This can lead to a set of dependence / 
independence relations that do not correspond to a probability distribution with a faithful graphical 
representation. This can lead to situations where the algorithm, after stage 9, returns a graph that is 
not the essential graph corresponding to the directed acyclic graph of any Bayesian network. Stage 10 
(algorithm [Tl]) deals with this and returns an essential graph. 

If it is not necessary to employ algorithm [TH then the graphical model returned is similar in quality 
to the network returned by the MMHC algorithm of [38]. If algorithm [TJ] is necessary, the network 
returned may fit the data substantially less well than the network returned by the MMHC algorithm. 

The definition of parents, children, ancestors and descendants is given in definition [16] Then a 
variable X is d-separated (definition [T9]) from the rest of the network by the Markov blanket M{X) 
(definition I2B]) . which is the set of parents and children and parents of children of X, from all the 
other variables in the network (exercise 6 on page 77 of |19)). For a faithful graph, d-separation 
and conditional independence statements are equivalent. Theorem 1311 shows that d separation implies 
conditional independence; by the definition of faithfulness (definition [22]), d separation and conditional 
independence are equivalent for a faithful graph. Theorem [5H] is crucial in the justification of the 
MMPC algorithm; it states that if a graph Q is faithful to a probability distribution p then there is 
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an edge {Xi , Xj ) in the skeleton if and only if there does not exist a subset S of variables such that 
Xi _L Xj [d. 

Locating the immoralities is a minor extension of this principle; if ^ is a faithful graph and there 
is a vee structure (definition |25]) X — Y — Z in the skeleton, and S" is a set such that X J- Z\S with 
Y ^ S, then X — Y — Z is an immorality. This is the content of theorem |221 

In computation, it may be safer to require also that, ii X — Y — Z is a vee-structure in the skeleton, 
then it is an immorality if and only if there exists a set S with Y ^ S such that X J- Z\S and also 
X IZ\SU{Y}. 

With these principles for constructing the skeleton and locating the immoralities, the justification 
proceeds as follows. 

Conditional Independence Tests Algorithm |4] is used to determine the conditional independence 
structure used to construct the skeleton and immoralities of a graphical model. 

For a subset S = {Xf^-^, . . . ,X/c,^} C {Xi, . . . ,X(i}, the notation A(i,j; fci, . . . , fc^) and A{i,j; S) 
will be used interchangeably. 

The initialisation is A{i,j; S) = 2 (or some value different from or 1) for all {i,j; S). 

A value A{i,j; S) = is established if and only if the relation Xi _L -^^^15 is added to the conditional 
independence relations and a value A{i,j; 5) = 1 is established if and only if Xi _L Xj\S is rejected. 

Once a value of A{i,j; S) is determined, then it is stored and it is not altered. 

Algorithm H] determines A{ki, k2;k3, ■ ■ ■ , km+2), according to the three principles listed on page 1101 
In the listing of algorithm |4l the following condition is referred to. 

Definition 7 (Condition H(/ci, k2', k^, . . . , A;j-|_i)). For a subset of variables 

{^fci , • • • , -'^fci+i } ^ {-'^i,--- ,Xd}, 
condition H(/ci, k2', k^, . . . , fci+i) is said to hold if and only if the following three conditions hold: 

G{Xk-^,Xk2\Xk^, . . . ,Xfc._|_2) < ^{Xk^jXk^lXkg, . . . ,Xk.^^) 
where G and 5 are given in definitionlJl equations (OP and ^. 

• For j = 3, . . . ,i + 2, 

A{ki,kj; {/C3, . . . , ki+2}\{kj}) = ^ A{ki,k2; {ks, ..., ki+2}\{kj}) = 

• For j = 3, . . . ,i + 2, 

A{k2, kj; {/C3, . . . , ki+2}\{kj}) = ^ A{ki,k2; {ks, ..., ki+2}\{kj}) = 
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Stage 1 This stage of the M^PC algorithm is algorithm |5] and corresponds to algorithm [T] of the 
MMPC algorithm. It differs from algorithm [T] (stage 1 of the MMPC algorithm) in that whenever it 
is decided to remove an edge {Xi,Xj) because Xi _L Xj\Sij, the set Sij is recorded. This is because if 
Xi _L Xj I Sij , then this may be used to determine whether or not any vee - structures (definition |25]) 
Xi — Y — Xj are immoralities in stage 2, listed as algorithm O 

A conditional independence statement Xj _L X^jS" implies that Xf. is not a neighbour of Xj. The 
neighbours of Xi , . . . , X^ are therefore subsets oi Zi, . . . ,Zii respectively. 

Stage 2 This stage in the M^PC algorithm is algorithm [6l This is a modification of algorithm El 
where the following major component is added: if Xj Zi^., then for each variable Y (^ Zj f] Z^, it is 
checked whether or the triple {Xj,Y,Xk) will be an immorality if {Xj,Y) and {Y,Xk) remain in the 
skeleton at the end. This is straightforward following the additional storage from stage 1; by definition, 
if Xj is excluded from Zi^., it follows that Xj ± Xig\Sjk- It therefore follows from theorem 1291 that if 
both {Xj,Y) and (Y, X^) are in the skeleton, then {Xj, Y, X/.) is an immorality if and only if Y ^ Sjk- 

This is a necessary stage in the M^PC algorithm, unlike stage 2 (listed as algorithm [2|) of the 
MMPC algorithm. 

If Xj G 2fc but Xfc Zj, then the statement 3^^^^ = 3^j,fc-i\{-^fc} removes variable X^. from the 
possible neighbour set of Xj, just as in stage 2 of the MMPC algorithm. 

At the end of stage 2, all structures between three variables that are vee - structures X — Y — Z 
using the current neighbour sets, have been checked; it has been determined whether or not they are 
immoralities, should the corresponding vee - structures still be present in the final graph. 

Stage 3 This stage of the M^PC algorithm is listed as algorithm [T] This is a modification of 
algorithm [3l with the following important addition. When a variable is removed, this may create 
additional vee structures and it has to be determined whether or not these will be immoralities, should 
they still be present in the final graph. 

The procedure for deciding when a variable is to be removed from Zj is the same as that in algo- 
rithm [3l but when it is decided to remove an edge Xi from Xj, the triples {Xi, Y, Xj) are investigated for 
all Y ^ Zif^ Zj (where Zi and Zj are the current candidate neighbour sets for Xi and Xj respectively) 
and tested to see whether or not {Xi , Y, Xj ) is an immorality. The computationally expensive part 
of this is the part that is used in the MMPC algorithm of jS^. The modification does not introduce 
substantial additional costs. 

Once one subset S has been established such that Xi _L Xj\S (the same set as the MMPC algorithm 
uses), it only has to be determined whether or not Y & S. If Y ^ S, then {Xi,Y, Xj) is an immorality 
provided Xi — Y — Xj is a vee structure in the final skeleton, li Y ^ S then {Xi,Y,Xj) is not an 
immorality. 

Stage 4 This stage of the M^PC algorithm is algorithm |8] and is exactly the same as stage 4 of the 
MMPC algorithm, algorithm |2] It is inserted for exactly the same reason described in stage 4 of the 
MMPC algorithm; there are situations (described in the discussion of stage 4 of the MMPC algorithm) 
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where at this stage Xj G Zk, but X^ ^ 2j. The node Xj has to be removed from Z^. Since this does 
not alter Zj D Z^, no new vee structures are created. 

Stage 5 This stage of the M^PC considers the set of triples I that are candidates for immoralities. 
Any triple {X, Y,Z) £ I is an immorality if and only if both {X, Y) and {Y, Z) are in the skeleton. 
This stage is listed as algorithm (9] The first part of this stage removes those where edges have been 
deleted, while the second part of algorithm orders the remaining members of Z so that they are listed 
as {Xi,YXj) for i < j. 

If the independence / dependence relations correspond to those of a probability distribution that has 
a faithful graphical representation, then there will not arise a situation where both (X, Y, Z) and 
(y, Z, W) are in the set of immoralities. This would lead to the direction Z >-^ Y for the direction of 
the directed edge between Y and Z from the first of these and Y >—^ Z for the direction of the directed 
edge between Y and Z for the second of these. 

The problem is that, even if the underlying distribution has a faithful graphical representation, a 
conditional independence relation may be inferred from the data. If the sample size is large enough, 
then this will happen with small probability for any given relation, but since large graphs are in 
view, this situation may happen and has to be considered. The final stage of algorithm |9] deals with 
this, carrying out additional tests if necessary when there are contradicting immoralities. The triple 
{X, Y, Z) is in I provided it is a vee - structure and also there exists a set S, such that Y ^ S and 
X _L Z\S. If it is an immorality, it should also satisfy X _L Z\S U {Y}. 

Summary of stages 1-5 After stages 1-5 have been complete, the skeleton is returned for a 
graphical model that correctly represents the dependence / independence relations established, pro- 
vided there is a faithful graph that fits them. Given the dependence / independence relations, the 
parts of the algorithm devoted to producing the skeleton are the same as the MMPC algorithm. 

In addition, for every vee structure X — Y — Z in the graph, it has been determined whether or 
not {X, Y, Z) is an immorality. No additional statistical calls are required; it is simply required to test 
whether the variable Y is in the set S that the M^PC algorithm uses to exclude the edge between X 
and Z] the edge (X, Z) is excluded because X }- Z\S and X — Y — Z \s a,n immorality \{Y ^ S. 

Stage 6 Stage 6 is listed as algorithm [101 It simply constructs the sets of directed and undirected 
edges from the set X of immoralities and the sets ^i, . . . , 2^, of neighbours of Xi, . . . , X^ respectively. 

Stages 7, 8, 9 The edge set E of an essential graph may be decomposed as E = DUU where D is 
the set of directed edges and U the set of undirected edges. Lemma [Ml states that the structures in 
figure [1] give all the situations where the direction of an edge (W, Y) is forced once the immoralities 
are determined. Algorithms II H [T2] and [T3] consider the structures 3, 1 and 2 in figure [1] in that order. 
Structure 3 indicates that if there is an immorality Xi —Y — X2, so that the directed edge set D contains 
the edges {Xi,Y) G D and (X2, T) G D and undirected edges {Xi,W) G U and {X2, W) G U, no edge 
between Xi and X2, then if there is an edge between W and Y, it must be directed {W,Y). 
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Figure 2: Example where additional immorality will be created 

Since the third structure deals with directions forced directly through the presence of immoralities, 
it seems convenient to start with this one first; directing edges according to the first two structures 
will not produce additional immoralities, provided that there does exist a graph that is faithful to the 
distribution. 

Therefore, if appearances of the third structure are dealt with first, it will be unnecessary to consider 
the third structure again; no new examples will appear when additional edges are directed according 
to occurrences of the first and second structures. 

It seems convenient to deal with the first structure of figure [5] secondly; directing edges according 
to the second structure will not produce additional configurations that have the first structure. If a 
direction W >-^ Y is directed in the second structure, thus playing the role oi Z ^ W in. the first 
structure, there is already an edge (which is Z — t- y in the second structure) in the graph playing that 
role. 

Therefore, if the third structure is considered first and the first structure is considered second, it 
will not be necessary to search for these configurations again after considering the second structure. 

Stage 10 If the conditional dependence / independence statements produced are consistent with a 
probability distribution that has a faithful graphical representation, then the conditional independence 
statements are d-separation statements in a corresponding directed acyclic graph and hence no ad- 
ditional immoralities will be produced when the strongly protected edges are added in; the resulting 
graph will be the essential graph. 

If they are not consistent, then it is possible that, using the procedure of the M^PC algorithm, 
additional immoralities will appear. This happens, for example, if a structure as in figure |2] appears. 
If such a structure appears, then either {X, Y, Z) or (Y, Z, W) will appear as an immorality in the final 
graph. 

There are other problems that may appear if the set of dependence / independence relations does 
not correspond to a probability distribution with a faithful graphical model. The algorithm in stage 
7 (algorithm [TT]) could produce cycles. This occurs if, for example, there is a structure of the type 
illustrated in figure [3l 

If the graph returned after stage 9 satisfies the properties listed in theorem |8l and no additional 
immoralities have been produced in the direction of the edges, then stage 10 is unnecessary and the 
graph produced without stage 10 will be the essential graph of a Bayesian network which corresponds 
to the conditional independence / dependence relations that were derived from the data. 

If the graph produced after stage 9 either contains directed cycles, or immoralities that were not 
in the original set of immoralities, or if the graph obtained by removing the directed edges is not 
triangulated, then step 10 is necessary. This step will produce a graph where the undirected edges 
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Figure 3: Examples where a directed cycle could be created by algorithm [TT] 

may be oriented to produce the DAG of a Bayesian network where all the associations that were 
derived from data are present, but where some of the conditional independence statements may not 
be represented by d-separation statements. 

Theorem 8. 1. The essential graph of a Bayesian network satisfies the three conditions listed below. 

2. If a graph with both directed and undirected edges, that has at most one edge between any pair of 
nodes satisfies the three properties listed below, then the undirected edges may be directed in such 
a way that the resulting directed graph is acyclic and does not contain additional immoralities. 
The conditions are: 

1. The graph formed by removing the undirected edges does not contain a cycle. 

2. The edges that appear in the structures shown in figure are directed according to the directions 
given in figure and 

3. The connected components of the undirected graph obtained by removing the directed edges are 
triangulated. 

Proof of theorem [8] Part 1 All directed acyclic graphs that are Markov equivalent to each other 
have the same skeleton and the same immoralities (theorem 131 p . The essential graph of a Bayesian 
network is the graph where the directed edges are those that have that same direction in every directed 
acyclic graph that is Markov equivalent and all other edges are undirected (definition [32]) . The directed 
edges are the immoralities and the additional strongly protected edges. 

The first condition is satisfied, because otherwise all the Markov equivalent DAGs would contain a 
directed cycle, which is a contradiction. The second condition is satisfied, because these are the strongly 
protected edges, which have the same direction in every Markov equivalent DAG corresponding to the 
essential graph. 

Suppose that the essential graph contains a cycle of undirected edges of length > 4 without a chord. 
Then in any corresponding DAG, the edges of the cycle must all be oriented the same way, otherwise 
there is at least one additional immorality. This implies that the DAG has a directed cycle, which is 
a contradiction. 
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Figure 4: Illustration for proof of theorem |8l A i— >■ C is a directed chord in a cycle of undirected edges 

Suppose that the essential graph contains a cycle of undirected edges of length > 4 with no undi- 
rected chords, but it contains directed chords. Then the graph restricted to nodes of the cycle is 
triangulated, otherwise a structure of the first type in figure [5] appears with W — Y undirected. Such 
structures do not appear. 

Therefore, there exists a structure of the type shown in figure 31 
If the directed edge A i— > C in the structure in figure 2] appears as part of an immorality A — C — D 
with a directed edge D >—?■ C and no edge between B and D, then D — C — B is a structure of the 
first type in figure [5] and hence there is a contradiction, since C >-^ B would be strongly protected. If 
A i-^ C appears as part of an immorality A — C — D with an undirected edge B — D, then the edge 
B >-^ C would be strongly protected, because it is part of a structure of the third type in figure [5j 
Therefore ^ i— > C is not part of an immorality in the essential graph. 

If ^ I— > C is not part of an immorality, but appears in the essential graph as part of a structure of 
the first type, E >—^ A h^ C, then there is an edge between E and B; otherwise the edge A h^ B would 
be strongly protected and hence directed, which is a contradiction. U E — B is undirected, then the 
structure between E, A and B is of the type shown in figured The direction B >-^ E is not permitted, 
since this would force B >-^ A as a strongly protected edge, which is a contradiction. If there is an 
edge E >-^ B in the essential graph, then there is a contradiction; either i? i-^ C is strongly protected, 
or else there is an edge between E and C in the graph, contradicting the assertion that E — A — C is 
a structure of the first type in figure |5l It follows that E — B is undirected and hence that E — A — B 
forms a structure of the type in figure |4] (two undirected edges, one directed edge). If E >-^ A is part of 
an immorality E — A — F, then there is an edge F — B, otherwise A — B would be a strongly protected, 
and hence directed, edge, li F >—^ B is directed from F to B, then B >—^ E is strongly protected and 
hence B >—?■ A is strongly protected (structure 2 of figure |5|) . If there is a direction B >—^ F, then B >—^ A 
is strongly protected; (A,B,F) form a structure 2 of figure |5|). Hence F — B is undirected and E,F 
play the roles of Zi, Z^ while A, B play the roles of y, W^ in a structure 3 from figure |5] and hence 
B ^^ Ais a strongly protected directed edge, which is a contradiction. 

It follows that E ^^ Ais not part of an immorality. 

If A I— > C is not part of an immorality, but appears in the essential graph as part of a structure 
of the second type, with a variable F and directed edges A^^ F and F >-^ C, then there is an edge 
between B and F. Otherwise F >-^ C — B is a structure of the first type and C i— > -B is strongly 
protected, which is a contradiction. B — F is undirected; if it were directed B h^ F, then this would 
force B >—^ C as a strongly protected edge and if it were directed F >—^ B, this would force i? i— >■ ^4 as 
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a strongly protected edge. Therefore the structure between A, B and F is of the type in figure |4] 
By similar arguments as those above, A^^ F does not appear as part of an immorality. 
By induction, considering structures in the essential graph of the type in figure HI if ^ i-^ C is not 
part of an immorality, then there is a similar adjacent structure, with a directed edge E ^^ A that is 
not part of an immorality. Working inductively, since there are no directed cycles, there is an infinite 
sequence of directed edges that are not part of an immorality, which is a contradiction. 

It follows that an essential graph does not have a cycle of undirected edges of length > 4 either 
without a chord or with a directed chord. The connected components of the graph, formed by removing 
the directed edges, are therefore triangulated. 

Part 2 It is sufficient to show that the undirected edges in a graph satisfying the three conditions 
can be directed in a way that produces a directed acyclic graph, with no additional immoralities. 

Firstly, consider undirected edges that form part of a vee-structure. Take one pair, direct them as 
a chain connection X ^^Y ^^ Z. If there is a vee-structure overlapping with one that has already been 
considered, direct the undirected edge in the structure in a way that does not produce an immorality. 
If this procedure produces a cycle, then it follows that there is a cycle of undirected edges of length 
> 4 without a chord, which is a contradiction. 

Once the undirected edges that appear in vee - structures have been considered, any directions for 
the other edges that do not produce a cycle is valid. Such an ordering exists; consider a sequence where 
either orientation for an edge will produce a cycle, then there is already a directed cycle. 

If the orientation of any of the edges produces an immorality with an edge that was directed in the 
original graph, then there is a vee - structure X ^^Y — Z with X ^^Y and Y — Z undirected, which 
is a contradiction. D 



Lemma 9. Stage 10 (algorithm\14\) produces the essential graph of a Bayesian network, corresponding 
to a graphical model that contains all the conditional dependence statements that were derived earlier. 

Proof of lemma [9] The addition of edges in the re-orientation stage ensures that the resulting graph 
will not contain conditional independence relations that are not in the data. Furthermore, these edges 
are necessary if a fork or chain connection is replaced with a collider. Arguments similar to those in 
the proof of theorem |8] ensure that after directed cycles and additional immoralities have been dealt 
with, there is no directed edge that is a chord in an undirected cycle of length > 4. Furthermore, 
when edges are added during the triangulation procedure, arguments similar to those in the proof of 
theorem [8] ensure that none of them appear in the position of VF — y in the ffi'st or second structure 
of figure [5] and the additional edge Zi — Z^ added when an undirected edge is added in the position 
of I^ — y in the structure of the third type in figure |5] during the triangulation ensures that no new 
strongly protected edges emerge during the triangulation. The resulting graph satisfies the criteria 
and, by theorem |8l is the essential graph of a Bayesian network. D 

Stage 7 This stage of the M^PC algorithm is listed as algorithm [11] All the immoralities have 
already been located in the previous stages; algorithm II 1 1 does an exhaustive search for configurations 
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where the direction of an edge W >-^ Y is forced corresponding to the third structure of figure |5l For 
each immorality {Xi,Xj,Xk) G I, enumerating the set Zi D Zj f] Z^ as 

Zi n Zj nZk = {Xa(i) , . . . , Xa(p)}, 

the directed edges {{Xam,Xj), . . . , (Xa(p\,Xj)} are added to D and the undirected edges 

{ {^a{l) ,Xj),... , (X^(p) ,Xj)} 

removed from U . 

Stage 8 This stage of the M^PC algorithm is listed as algorithm [12] It does an exhaustive search 
for structures of the first type in figure [S] The following loop is carried out until no more structures 
of the type are located: for each directed edge {Xi,Xj) € Z?, with the set Zi\Zj enumerated as 
Zi\Zj = {Xq,(]^), . . . ,Xq(^)}, if {Xj,X^n\) € U, then it is removed from U and the directed edge 
{Xj,X(^n\) added to D. 

When considering the first structure, directing an edge W >-^ Y will not create any additional 
immoralities if there is a faithful graph, since all the immoralities have already been located. If a new 
immorality is produced by the directing of an edge at this stage, it means that there is not a faithful 
graph structure to represent the set of conditional independence statements. 

After this stage, there are no structures on three nodes W,Z,Y with directed edge {Z,W), undi- 
rected edge {W, Y) and no edge between Z and Y. 

Stage 9 This stage of the M^PC algorithm is listed as algorithm [131 It deals exhaustively with 
structures of the second type in figure [5] For each directed edge {Xi,Xj) G D, let 

yj = {Xk\{x,,Xk)eD}, 

the set of directed edges from Xj to X^. Then, for each Xj G J^j n Zi, the edge (Xj,Xfc) is added to 
D and the edge {Xi,Xk) is removed from f7, the set of undirected edges. This is repeated until there 
is a cycle where no additional edges are directed. 

After this stage, there are no longer any triples of nodes W, Z, Y with directed edges {W, Z) and 
(Z, y) and an undirected edge {W,Y). If there is a faithful graph, then this stage does not introduce 
new immoralities and hence no further strongly protected edges of the type shown in the third structure. 

Furthermore, directing the edge W ^Y m. the second structure cannot create additional structures 
between three nodes Ari,X2,X3 with a directed edge (Xi,X2) and an undirected edge {X2,X-i) and 
no edge between Xi and X3; the edge (Xi,Ar2) has just been directed and, if it has been directed 
according to the principle in the second structure, there is already a node X4 forming a directed edge 
(X4,X2) (where X4 plays the role of Z in the second structure). Therefore the edge between X2 and 
X3 has already been directed in the previous stage. 
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Algorithm 4 The Modified MMPC Algorithm: determining A{i,j;S) 

initiahsation A{i,j; ki, . . . , k^) = 2 for all m > and all {i,j, ki, . . . , k^) S {1, • • • , d}'""'"^ 
A{i,j; S) = represents Aj _L Xj\S; A{i,j; 5) = 1 represents Aj JL Xj\S 
for i = 0, . . . ,m do 

for each permutation o" of A;i , . . . , kra+2 of length i + 2 do 
if A(A;^(i) , A:^(2) ; A:^(3) , . . . , A:„(i+2) ) = or 1 then 

do nothing 
else 

if Condition E{k^(^i-^, k^(2y., /^^{a)) ■ ■ ■ ■, ka{i+2)) (definition [7| holds then 

Set B{k„^i) , /c^(2) ; A;^(3) , . . . , A;^(m+2) ) = 
else 

Set B{k„(^^ , /c^(2) ; A;^(3) , . . . , A;^(m+2) ) = 1 
end if 
end if 
end for 

for each permutation a oi ki, . . . , km+2 of length i + 2 do 
if B{k„(i),k„(2);K{:i), ■ ■ ■ ,K{ra+2)) = 1 then 

Set A{k^(i) , k„(2) ; k^(^) ,..., k^(rn+2) ) = 1 and A{k„(2) , ka{i) ; k^(3) ,..., /co-(m+2) ) = 1 
else 

for J = 3, . . . , ?n, do 

if ^(^<7(i),^a(j); {^a(2), • • • , K(m+i)}\{Kij)}) = and 
{A{ka{i):Ki2);(p) = 1 or A(A;^(i),/i;^(j);(/>) = 1) then 

Set B{k„^i) , /i;^(2) ; A;^(3) , . . . , A;^(m+2) ) = 1 
end if 

if B{k^^2),Kuy^ {^a(i),^a(3)> • • • >^a(m+i)}\{^<7(j)}) = and 
{A{K{i),ka(2)) = 1 or vl(A:^(2),A;^(j)) = 1) then 

Set B{k„{i) , k„(2) ; ^cr(3) ; • • • ) K{7n+2) ) = 1 

end if 
end for 

Let A{k„(^i),k„(^2)',K{3)^--- ,K{ni+2)) = B{k^(^i),k^(^2)'iK{3), ■ ■ ■ ,K{m+2)) 

and A{k^(^2),K{i)',K{3),--- ,K{m+2)) = -B(A;^(i), A;^(2); ^<t{3)) • • • ,K{m+2)) 
end if 
end for 
end for 
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Algorithm 5 The Modified MMPC Algorithm, Stage 1 



For a subset S = {Xk^, . . . ,Xk^} C {Xi, . . . ,Xd}, let A(i,j; S) and A(i,j; ki, . . . , km) both denote 
the same thing. A{i, j; S*) = if Xj _L Xj\{Xk-^ , ■ ■ ■ , X^^ } is included in the set of dependencies and 
A{i,j] S*) = 1 if Aj _L AjlJAfc^, . . . , A,fc^} is rejected from the set of dependencies. 
for j = 1, . . . , d do 

Set Zj^Q = (f) (the empty set) 
for i = 1, . . . , d do 
if i = J then 
bet ^j^i = ^j^i—i 



else 

if A{i,j -,2 j^i^i] 



Set Z. 



.?:« 



z. 



■i,«-i 



then 



-'J,* 



else 



Set Zj^i = Zj^i^i U {Xi} 
end if 
end if 
end for 

end for 

Note: Sij is the set such that Aj _L Xj\Sij that determines that the edge {Xi,Xj) is not in the 

skeleton 






Figure 5: Strongly Protected Edges 
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Algorithm 6 The Modified MMPC Algorithm, Stage 2 



Recall A(i,j; S) = if Aj _L Xj\S has been accepted; A{i,j; S") = 1 if Aj / Xj\S has been accepted. 

Set Z = (p. By the end of stage 5 (algorithm |9]) of the M'^PC algorithm, this will be the set of 
immoralities. 
for j = 1, . . . ,d do 
Set yjfi = Zj 
for k = 1, . . . ,d do 
if Xj Zk then 
for i = 1, . . . ,d do 
if Aj G Zj n Zk then 
if Aj ^k,j-i then 

LetX = ZU{(A,,Ai,Afc)} 
else 

Leave X unaltered. 
end if 
end if 
end for 

Set yj^k = yj,k-i\{Xk} 
else 

Set yj^k = yj,k-i 
end if 
end for 
end for 
Set Zj = yj^d 
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Algorithm 7 The Modified MMPC Algorithm, Stage 3 



Recall A{i,j; S") = if Aj _L Xj\S has been accepted; A{i,j; S*) = 1 if Aj _L Xj\S has been rejected. 
for J = 1, . . . , d do 

Set yjfi = Zj 

for i = 1, . . . ,d do 

if Aj G yj,i-i and there exists a subset S C 3^^ j_x\{Aj} such that A(i,j; S) = then 

Set yj,^ = yj,i^i\{Xi} 

Let Sij = S. 

for p = 1, . . . ,d do 

if Xp € {yj,i\S) n Zi and Xp ^ S then 

Let X = ZU{(Aj-,Ap,Ai)} 
else 

leave I unaltered. 
end if 
end for 
else 

Set yj^i = yj,i-i 

end if 
end for 
end for 

Set Zj = yj^d 



Algorithm 8 The Modified MMPC Algorithm, Stage 4 



for j = 1, . . . ,d do 

Set yjfl = Zj. 

for k = I, . . . ,d do 

if Afc € yj,k-i but Aj ^ Zfc then 

Set 3^,- fc = yj,k-i\{Xk} 
else 

Set 3^,-,fc = yj,k-i 
end if 
end for 
Set Zj = yj^d 
end for 
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Algorithm 9 The Modified MMPC Algorithm, Stage 5 



This algorithm removes those elements from X that are not immoralities in the final graph and 
ensures that an immorality appears in X listed as {Xi,Y, Xj) for i < j. 
for i = 1, . . . ,d do 

for {j,k) € {l,...,d}'^,i^j,i^k,j^k do 
if iXj,Xi, Afc) G X and Xi Zj n Zk then 

LetX = X\{(A„A„Afc)} 
end if 
end for 
end for 

for i = I, . . . ,d do 
for 1 < j < k < d do 
if {Xk,Xi,Xj) GXthen 

Let X= (XU{(A„A,,A,,)})\{(A,,,A„A,)} 
end if 
end for 
end for 

The next stage is inserted in case the set of A{i,j; S) that have been established do not correspond 
to a distribution that has a faithful graphical model, in which case some of the immoralities may be 
incompatible. 
for I < i < k < d do 

for j = l,...,d; j ^i, j ^k do 
if {Xi,Xj,Xk) GXthen 
for p = I, . . . ,d do 

if {Xj,Xk,Xp) G X or {Xp, Xk, Xj) G X then 
if A{j,p; Sjp U {Xk}) = then 

Set X = I\{{Xj,Xk,Xp), {Xp,Xk,Xj)} 
else 

SetI = I\{{Xi,Xj,Xk)} 
end if 
end if 
end for 
end if 
end for 
end for 
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Algorithm 10 The Modified MMPC Algorithm, Stage 6 



Set Ui = Z\ 

for i = 2, . . . , d do 

SetZ^, = ZA{Ai,...,Ai_i}. 
end for 

By the end of the algorithm, the essential graph will contain an undirected edge (Aj, Aj) if and only 
if Aj G Ui for j > i and Xj Hi for j < i. 

Set V\ = . . . =Vd = ^- By the end of the graph, these will be the sets of variables such that, in the 
essential graph, there is a directed edge y i-^> Aj if and only if y G "Pj. 
for i = l,...,(i — Ido 
for /c = z + l,...,(ido 
for J = 1 , . . . , d do 

if (Ai,Aj-,Afc) GZthen 

Let Ui = Ui\{Xj}, let Uk = Uk\{X,}, let Uj = Uj\{Xi,Xk}. 
heiVj=Vj\J{Xi,Xk} 
end if 
end for 
end for 
end for 



Algorithm 11 Locating Additional Strongly Protected Edges, Stage 1 



for i = 1, . . . ,d — 1 do 
for /c = i + l,...,(ido 
for J = 1 , . . . , d do 
for / = 1, . . . , d do 

if A; € Zj n Zj n Zk then 

Let Vj = Vj U {AJ, Ui = Ui\{Xj} and Uj = ^\{AJ 
end if 
end for 
end for 
end for 
end for 
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Algorithm 12 Locating Additional Strongly Protected Edges, Stage 2 
Let L»2 = 1 
repeat 

Let Di = 
for i = 1, . . . , d do 
for j = 1, . . . ,d do 
if {Xi,Xj) eDthen 
for k = 1, . . . ,d do 
if Afc G Zj\Zi then 

Set Pfc = Pfc U {A,}, Di = Di + 1, ^/, = ^^,\{^fe}' ^fe = l^k\{X,} 
end if 
end for 
end if 
end for 
end for 
Let D2 = Di 
until D2 = 



Algorithm 13 Locating Additional Strongly Protected Edges, Stage 3 
Let D2 = 1 
repeat 

Let Di = 
for i = 1, . . . , d do 
for J = 1 , . . . , d do 
if (Ai,Aj) G dthen 
for A; = 1, . . . , d do 

if {Xj,Xk) G D and Xj G Zj then 

Let Pfc = T'fc U {AJ, Di = Di + 1, let ^/, = ^^Al^fe}, let Uk = Uk\{X,}. 
end if 
end for 
end if 
end for 
end for 
Let D2 = Di 
until D2 = 4> 
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Algorithm 14 Locating Problems Caused by non-existence of Faithful graphical model 

Recall that X is the set of immoralities obtained from data. Let Dq denote the set of edges such that 
(X, y) G Pq if ^-iid only if either (X, Y^Z') G X or (Z, 1", X) € X for some Z. Let £" denote the set of 
edges in the skeleton before this stage of the algorithm. 
repeat 
repeat 

Locate a cycle of directed edges. Select at random an edge in the cycle that does not belong to 
"Dq. Change the orientation of the edge. If this results in an immorality {X,Y,Z), then add a 
directed edge X >-^ Z or Z >-^ X, choosing a direction, if possible, that does not create a cycle. 
until there are no directed cycles. 
repeat 

Locate fC, the set of immoralities {X, Y, Z) such that (X, Y, Z) ^ I, and {{X, Y), (Y, Z)} <Z 8. 
for each immorality (X, Y,Z) G /C add either the directed edge X ^^ Z oi Z ^^ X \n such a way 
that no directed cycles are created. (This is clearly possible - if both directions cause a directed 
cycle, then there is already a directed cycle present). 
until all the immoralities in /C have been dealt with. 

Perform algorithms [11] [TJl [131 locating and orienting undirected edges that take the position 
(W, Y) in one of the three structures in figure [5] and directing them W >-^Y according to figure [5) 
until there is a full run in which no changes are made. 

Add in the necessary undirected edges to triangulate the sub-graph obtained by removing the 
directed edges, adding in the undirected edge Zi — Z2 if this involves adding in an undirected edge 
in the position taken by TV — y in structure 3 of figure [5] 
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5 Time Complexity of the Algorithms 

A common measure of the 'complexity' (or time required) for an algorithm is the number of statistical 
calls that are made (definition |6]) . This is discussed in Spirtes, Glymour and Scheines |36] and Mar- 
garitas and Thrun |23) . Both of these use the G^ statistic (definition [1] equation ([TJ), calculating the 
p value and accepting independence if the p value is less than the nominal level a, usually a = 0.05. 

If the number of variables is very large and the number of instantiations is very large, then a more 
accurate picture of the complexity would be the number of statistical calls necessary, where a statistical 
call is defined as a reference to the data set in order to compute the marginal cell counts for a subset 
of the variables. 

Recall that the M^PC algorithm locates the skeleton and immoralities through algorithms [5l [6l [3 
ISl |9]and[Tni Of these, algorithms |9] and [TOl are purely algebraic rearrangements, checking whether or 
not the edges associated with immoralities obtained previously have been deleted and if not putting the 
edges into the directed edge sets. Stage 4, given by algorithm [5] is also purely algebraic, removing Xj 
from Zf: if X^ ^ Zj. These parts of the algorithm do not have significant computational complexity; 
they do not require tests of conditional independence. Following the criteria of Spirtes, Glymour and 
Scheines, and also Margaritas and Thrun, the computation complexity of the algorithm comes from 
the conditional independence tests and these are all located in stages 1,2 and 3; algorithms Ul E] and [7] 
respectively. 

Stages, stages 7, 8 and 9 (finding the additional strongly protected edges) given by algorithms [TT| 
[12] and [13] respectively are algebraic, locating those edges whose directions are logically forced by the 
immoralities. 

If the set of conditional independence statements does not correspond to a distribution with a 
faithful graphical representation, then stage 5 (algorithm [9]) may require some additional tests; the 
number will increase depending on the level of inaccuracy of the faithfulness assumption. 

If the complexity is measured by the number of test statistics (definition [T]) , then the comparison 
is as follows. 

Stage 1 (algorithm [SJ If the test procedure outlined in |38) is used, there is an upper bound of 
d{d— 1) conditional independence tests. This may be reduced if it turns out that a test for conditional 
independence has already been made. This is the same as the first stage of the MMPC algorithm. 

Stage 2 (algorithm |6|) does not require additional statistical calls. For the current neighbour sets 
Zi^. . . Z(i, it considers situations where Xj ^ Zj. and removes the variable Xj. from Zj if X^ € Zj. If 
a variable Xj € Zj H Z^, it has to be determined whether or not the vee - structure (definition [25]) 
Xj — Xi — Xk is an immorality. This is done by determining whether or not Xj is in a set that has 
already been determined in stage 1. 

For determining upper bounds on the number of statistical calls required in the next stages, the 
following definition 
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Definition 10 (Maximal size of neiglibour sets). Let C = niaxj=i^...^(i \^j\> denote the size of the largest 
set obtained after stage 1 of the M^PC algorithm. 

Let Q denote the size of the largest parent / child set in the essential graph of a Bayesian network. 

Let P denote the size of the largest parent set for a variable among all the faithful DAGs. 

Stage 3 (algoritliiii[7} Tliis is tlie computationally most expensive part of the algorithm. As pointed 
out in section 7 of j38], the complexity for constructing the skeleton (that is, deciding which nodes 
to remove from Zj) is 0(2' j'), where \Zj\ is the number of variables in Zj. Working through the 
variables in order, a variable Xi is removed from Zj if there exists a subset S C Zj\{Xi} (where Zj 
represents the current set of variables that have not yet been eliminated) such that Xj _L Xi\S. This 
involves working through the subsets until one is found, or working through all the subsets if there is 
an edge {Xi,Xj) in the skeleton. There are 2'^' possible subsets of Zj. The article [38] therefore states 
the bound as 0{d2^), where C is the upper bound in definition [TOl If d is large, then the algorithm is 
clearly not feasible if C is of order d. 

Furthermore, there is a limit, depending on the size of the data set, on the size of the conditioning 
set for which independence tests can be made. For a test of Hq : X }-Y\S, \l \s required that for all 
{x,y), Nx,Y,s{x,y',s) > 5 (where -/Vx,y,5(a;,y; s) is defined in definition [1]) for at least one instantiation 
s of S. This is the criterion suggested and applied in |38| . It is necessary for the x^ distribution to 
have any level of accuracy, but for a fixed number of data, the larger the number of states that the 
conditioning set S can take, the lower the power of the test and the smaller the probability that Hq is 
true when the result of the test is 'insufficient information to reject Hq\ 

An upper bound of size /, for a fixed I that is determined either by the size of the data set, or else 
computational cost or both, has to be placed on the subsets taken from Zj for each j = 1, . . . ,d. In 
this case, there is an upper bound on the number of subsets to be considered given by 

< de&. 
=0^^ J 

This is in line with the order stated in |38) . 

In practise, it is often observed that even though the parents and children set of Xj is a subset of 
Zj obtained after stage 2 nevertheless, maxj=i ,,,^rf \Zj\ (the maximum size of Zj, j = 1, . . . ,d obtained 
after stage 2) is often of the same order as the size of the largest parents / children set, G (definition [10]) • 
That is, although < C it usually turns out that C = 0(G). This remark is made in j38] section 7. 

In line with [38], the complexity of the algorithm for locating the skeleton is bounded above by 

d{d - 1) + edC^ (5) 

where / is the size of the largest conditioning set permitted, d is the number of variables and C is the 
number of variables in the largest set Zj obtained after stage 2. 

The additional complexity required to obtain the immoralities is small by comparison; no additional 
statistical calls are required. 
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It follows that the number of conditional independence tests required for the MM PC algorithm 
and for the M^PC algorithm when the modification for incorporating immoralities is included, but the 
modification to the independence tests is not included, is bounded above by expression ([5|). Using this 
measure of complexity, both algorithms have equal complexity, but the modification requires additional 
algebraic stages. 

Complexity with the Modified Procedure to Establish the Conditional Independence 
Statements The modification that alters the way that conditional independence is determined, yields 
a substantial increase in the complexity. The example in subsection 17.11 illustrates . on an example with 
6 variables with 1000 observations, that running the algorithm without the modification can lead to the 
removal of edges that result in a d-separation statement that is present in the graphical model, when 
the corresponding independence statement has been rejected at the 5% significance level. The measure 
required to circumvent this measure is computationally expensive; in situations where adequate results 
are achieved without the modification, the additional accuracy may not be justified by the increase in 
complexity. 

If Xi _L Xj\{Xj.-^ , ■ ■ ■ , ^km} i^ considered, there are 

•^^^ I < (m + 2)(m + l)2"^-i 

tests to be carried out. Let C = max{|^ij, . . . , |2(j|} be the maximum size of the sets obtained after 
the first stage of the algorithm, then it follows that there is an upper bound of 

d{d-l){C + 2){C + l)2^-^ (6) 

This corresponds to an increase in the computational complexity by a factor of 

(C7 + 2)(C + 1)2^-1. (7) 

on the number of tests required to perform the first stage of the algorithm if no limit is put on the size 
of the conditioning sets. This upper bound is substantially larger than the true number of statistical 
calls required; it does not take into account that Zi f] Zj j^ (p for some {i,j) and that once the value 
of A{i,j; S) is registered, it is not computed again. 

If a limit of size / is put on the size of the conditioning sets, then for a neighbour set of size C where 

f C \ 

C > I, there are values of A{p, q; S) to be established to determine whether or not Xj. G Zj 

if the modification to the independence testing is employed. There is therefore an upper bound of 

d{d-l)( ^^^ yi + 2){l + 1)2'-' <d{d- 1)^(1 + 2)2'-' (8) 

on the number of chi squared tests required to complete stage 1 of the M^PC algorithm. After these 
tests are performed, it is clear that no further conditional independence tests are required. 
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If a limit / is put on the size of conditioning sets, then if \Zr^\ > Z, it follows that an upper bound of 

I \H \ 

tests are required in stage 1 t 
set. There is clearly an upper bound of 



ZA \ 

tests are required in stage 1 to determine that a variable should be added to the neighbour 



^(/ + 2)2'-i (9) 

on the factor by which the computational complexity is increased when the modification for determining 
conditional independence is adopted. Note that when the modification for determining conditional 
independence is applied, all the conditional independence statements required are computed in stage 
1 of the algorithm. 

Alternative Complexity Measures If the complexity is measured by the number of statistical 
calls, where a statistical call is defined as a computation of the marginal empirical probability distri- 
bution over a set of variables that requires reference to the data set, then all the statistical calls are 
made in stage 1 of the algorithm; there is an upper bound of d{d — 1). There is no difference in the 
number of statistical calls required for the MMPC and the modified MMPC. The modification to the 
method for testing conditional independence does not require any further calls to the data set; if it is 
required to determine a conditional independence relation for a set of variables Af, then the empirical 
probability distribution for any subset of X can be derived from the empirical distribution over X. 

It is expected that, as the number of variables and number of instantiations increases, the deter- 
mining factor will increasingly be the number of calls to the data set. 

6 Related Work 

The work in this article is motivated by the MMPC part of the MMHC algorithm of [38], noting 
that with a relatively small addition in the computational complexity, the essential graph can be 
recovered by the modification proposed for the M^PC algorithm that locates the immoralities, thus 
rendering the edge orientation phase (the most time consuming phase) of the Maximum Minimum Hill 
Climbing algorithm of |38) unnecessary. This modification leads to a substantial improvement in the 
time taken. The modification introduced to ensure that the conditional independence statements are 
consistent with the dependence statements derived from data improves the accuracy of the algorithm, 
as illustrated in the example in subsection 17.11 

As mentioned in section [1] the main approaches to the problem of locating the graph structure are 
constraint based (testing for conditional independence), search- and-score (maximising an appropriate 
score function) and hybrid (methods that use both constraint based ideas and search and score ideas). 
The Maximum Minimum Hill Climbing algorithm presented in |38) is a hybrid algorithm, using the 
constraint based MMPC procedure to obtain the skeleton and, once the sets of neighbours for each 
variable is established, it uses a greedy search and score algorithm, starting with an empty graph and, 
using the Bayesian Dirichlet metric (discussed later) for scoring, adds or deletes a directed edge or 
reverses the direction of an existing directed edge, to produce the directed acyclic graph that gives the 
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maximum score at each stage. The available edge set for the edge orientation phase is restricted to 
those of the skeleton located by the MMPC algorithm. A list is kept of the previous 100 structures 
explored and only those add / delete / reverse operations that do not result in a DAG on the list 
are considered. When 15 changes are made to the DAG without an increase in the maximum score 
that has been encountered up to that point in the search, the algorithm terminates and delivers the 
structure with the highest score obtained so far. 

The standard scoring functions are those obtained by assuming that, given the graph structure, the 
probability distribution may be factorised according to the graph, with a Dirichlet distribution over 
the parameter space. If a uniform prior distribution is taken over the set of possible graph structures, 
the score function becomes the likelihood function for the data given the graph structure. This is 
described in Heckerman, Geiger and Chickering |17) and is the method used in the MMHC algorithm. 

Other score functions are the Bayesian Information Criterion (BIG) (Schwartz [33])) the Akaike 
Information Griterion (AIG) (Akaike [T]), Minimum Description Length (MDL) (Rissanen |31) and |32) ) 
and the K2 scoring method (Gooper and Herskovitz [9]). The Bayesian Dirichlet scoring method has 
the following necessary property: Markov equivalent structures have the same score. 

The most basic search-and-score algorithm is the Greedy Search, which searches the space of 
Directed Acyclic Graphs (DAGs) (Ghickering, Geiger and Heckerman [7]). The size of the search 
space is super-exponential in the number of variables and there have been several modifications to 
reduce the size of the search space. The Greedy Bayesian Pattern Search algorithm (GBPS) searches 
over equivalence classes of DAGs. This is basically a search over essential graphs (definition 132 p . An 
example is the Greedy Equivalent Search (GES) algorithm (Ghickering |[5]). 

The K2 algorithm (Gooper and Herskovitz [9j) requires an ordering of the variables and uses the 
K2 metric with a greedy hill - climbing search, where the parent sets for each variable are taken from 
those variables of a lower index. With this algorithm, it is important to repeat it a large number of 
times with different orderings. The Sparse Candidate algorithm limits the number of parents that each 
variable is permitted to have to k and uses a greedy search. The Optimal Reinsertion (OR) algorithm 
by Moore and Wong [25] takes a node, removes all the edges to and from the node, and then re-inserts 
an optimal set of edges. 

Several constraint based algorithms were compared with the MMPG algorithm in [38j. These 
decide to delete an edge, thus adding a constraint to the network, based on testing for conditional 
independence. Among these were the PG algorithm (Spirtes, Glymour and Scheines |36) ) and The 
Three Phase Dependency Analysis (TBDA) by Gheng, Greiner, Kelly, Bell and Liu [4], which uses an 
information score function when testing for conditional independence. 

The GB algorithm by Singh and Valtorta |34] uses a search-and-score algorithm developed by 
Spirtes Glymour and Scheines, the SGS algorithm (which was the prototype of their PG algorithm) to 
order the nodes and then uses the K2 algorithm to orient the edges. The PG+GBPS algorithm, by 
Spirtes and Meek |37) uses the PG algorithm to establish an initial pattern, which is then used as the 
basis of a search and score procedure. The Essential Graph Search (EDS) by Dash and Druzdzel |11) 
uses the PG algorithm repeatedly, with random node orderings and thresholds. The BENEDICT 
method, which comes from BElief NEtworks Discovery using Cut - set Techniques is another method 
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that defines a search-and-score metric to measure the distance between a proposed graphical model 
and the empirical data, using d-separation to localise the problem into smaller sets of variables. 

Kovisto and Sood |20] developed an exact algorithm to identify the network that maximises the 
posterior probability density, using Bayesian Dirichlet scoring. This algorithm works well for data 
sets with less than order 100 variables, but cannot be scaled to the large data sets, containing several 
thousand variables, that are required for learning gene regulatory pathways. 

There are several other variants in the literature on the same basic ideas. Goldenberg and Moore |16) 
presented an algorithm for learning Bayesian networks from sparse data by locating sets that occur 
frequently; their approach may be used on large variable sets. 

There are also variational methods, discussed in Jordan, Ghahramani, Jaakkola and Saul |18] . 
which typically cannot be used for large networks. 

7 Evaluation 

Algorithms are usually compared against a procedure known as the gold standard. 

Definition 11 (Gold Standard). The gold standard is the procedure where all possible DAGs are 
enumerated and scored using an appropriate score function. 

In [38], several networks were used to provide a comprehensive summary of the performance of MMHC 
against the following algorithms: 



• Sparse Candidate (SC by Friedman, Nachman and Pe'er, 1999 [T 

• PC (by Spirtes, Glymour and Scheines, 2000 |36) ) 

• Three Phase Dependency Analysis (TBDA, Cheng et. al., 2002 [4j) 

• Optimal Reinsertion (OR, Moore and Wong, 2003 p5] ) 

• Greedy Equivalent Search (GES, Chickering 2002, [6]) 

In this article, the M^PC algorithm is compared with the Maximum Minimum Hill Climbing Algorithm 
(MMHC) in [38] . The main functions of this article are as follows. 

1. Determine the immoralities in addition to the skeleton using the MMPC stage of the algorithm, 
thus rendering the edge orientation phase of the MMHC algorithm (which is the computationally 
expensive phase) unnecessary. It is clear from the listing of the algorithm that this modification 
can be incorporated into the MMPC algorithm of [38] with little computational overhead, thus 
leading to a major reduction in the computational time necessary. 

2. Point out the difficulties with the method for determining conditional independence that became 
immediately apparent in a simple example on six variables, suggest a modification and justify 
the modification theoretically. 
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3. In addition, algorithm [T4l is inserted to deal with some cases when the conditional independence 
statements do not correspond to a probability distribution that has a faithful Bayesian network. 
The example in subsection 17. H where the assumption of faithfulness does not hold, indicates that 
an algorithm should be able to deal with this. 

The main purpose of the algorithm is to find a suitable graphical model for the dependency structure 
among large numbers of variables. The main application in view is genetic data for locating gene 
regulatory pathways. 

7.1 Women and Mathematics 

The algorithm was applied to the 'Women and Mathematics' data set. This data set is instructive, for 
several reasons: 

• It motivates the modification for determining conditional independence in the M^PC algorithm, 
the procedure for determining whether a statement X }- Y\S should or should not be added 
to the set of conditional independence statements used to construct the graph. Without the 
modification, there is a missing edge, the omission of which gives a misleading impression of the 
dependence structure between the variables. 

• It motivates the last stage (algorithm [H|) of the algorithm, since there is not a distribution with 
a faithful graphical representation corresponding to the logically consistent set of dependence 
relations derived using this method for determining the dependence relations. 

• Even with the modification, there are d-separation statements in the final graph that do not 
correspond to independence statements. In figure [71 E _L F||g{C, D}, but G{E,F\C,D) = 9.83 
(where G is defined by equation ([!])) while X4,0.05 = 9.45, so that conditional independence is 
rejected at the 5% significance level. This shows the limitations of the algorithm, even with 
the modification, and the problem arises with the assumption of faithfulness in determining 
edge selection. Broadly speaking, a situation where A and C are independent, B and C are 
independent, but where A and B taken together tell us everything about C, will not be detected 
by an algorithm that selects edges based on an assumption that there is a faithful distribution. 

The variables are A- lecture attendance (whether the person attended a special lecture by a female 
mathematician designed to encourage women to take up mathematics) B - gender C- school type (urban 
/ suburban), D - 'I'll need mathematics in my future work' (agree = y / disagree = n), E- subject 
preference (mathematical sciences / liberal arts), F - future plans (higher education / immediate job). 
The data is given in table 17.11 

The MMPC from |38] (that is, without any modifications) produced the skeleton corresponding 
to the graph in figure (6] This in line with other algorithms; the (MC)^ and A{MC)^ Markov chain 
Monte Carlo algorithms by |22) produce the essential graph of figure [6l 

But this graph does not represent the dependence relations in the data; Hq : C -L E is rejected 
when a significance level of 5% is employed, which is the nominal significance level used by the MMPC 
algorithm, while C is d-separated from E in figure [G] 
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The p value is slightly greater than 1%; the hypothesis test would not be rejected at the 1% 
significance level. The fact that there is no edge between C and E in the graph returned by the 
A{MC)^ algorithm is due to the penalty for additional edges in the score function. 

The MMPC algorithm proceeds as follows: with the ordering A, B, C, D, E, F on the variables, the 
stages are as follows: 

Stage 1 The candidate neighbour sets after stage 1 are 
Za = <P, Zb = {D,E}, Zc = {E,F}, Zd = {B,E,F}, Ze = {B,C,D}, Zf = {B,C,D}. 

Stage 2 This involves the removal of nodes Xj from Z^ if X^ Zj. This only results in one change; 
B is removed from Zp, giving 

Za = cI), Zb = {D,E}, Zc = {E,F}, Zd = {B,E,F}, Ze = {B,C,D}, Zf = {C,D}. 

Stage 3 In this stage, node Xk is removed from the set Zj if there is a subset S of the current set 
Zj\{Xk} such that Xj ± Xk\S. There is only one alteration at this stage; node C is removed from 
Ze- It turns out that C JL E and C /. E\D, but that C ± E\{B,D}. After this stage, the sets of 
neighbours are 

Za = ^, Zb = {D,E}, Zc = {E,F}, Zd = {B,E,F}, Ze = {B,D}, Zf = {C,D}. 

Stage 4 This stage makes a final removal of nodes Xk from Zj if Xj Z^. Only one alteration is 
made; E is removed from Zq- The final neighbour sets are 

ZA = (t>, Zb = {D,E}, Zc = {F}, Zd = {B,E,F}, Ze = {B,D}, Zf = {C,D}. 

Finally, testing vee - structures (definition |25]) to see if they are immoralities (definition |23]) gives: 
E IF andE 1. F\D so that {E, D, F) is not an immorality. D _L C, but D / C\F so that (D, F, C) 
is an immorality. The essential graph corresponding to the dependence / independence structure 
returned by the MMPC algorithm, with the vee structures tested to determine whether or not they 
are immoralities, is therefore given in figure |6l 

This graph does not represent the conditional dependencies that have been established. At nominal 
significance level of 5%, the significance level being used, the test rejected C _L i?, but in the graph in 
figure [6l C is d-separated from E. For the pair of variables C, E, the cell counts for ncE are 

ncE{l, 1) = 194, nc£(l, 0) = 149, ncE(0, 1) = 439, ncE(OO) = 305 

so that 

^2 o /^on., 294x1187 ,,^, 149x1187 

Gip = 2 X 294 log h 149 log 

^^ V 443 X 733 ^ 443 x 454 

439 X 1187 , 305 x 1187\ 

+439 X log — — - + 305 X log -— — - = 6.42. 

774 X 733 454 x 744 / 
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Table 1: Data for 'Women in Mathematics' Source: Lacampagne (1979 J2T 




Figure 6: Women and Mathematics - Essential graph obtained using MMPC algorithm, plus testing 
of vee - structures to locate immoralities 
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Note that 



Xi,o.05 = 3.84, Xi,o.oi = 6.63 



so the hypothesis that C -L E is rejected at the 5% significance level; the test has a p value close to 
0.01. Yet, for the network in figure [6l C _L £'||g</> (C is d-separated from E when none of the variables 
are instantiated). 

Similarly, in the graph, C _L E\\g{D}. But the hypothesis that C _L E\D is rejected at the 5% 
significance level, so that C JL E\D is accepted. 

ncDE(l,l,l) = 208, ncD£(l,l,0) = 71, ncD£(l,0, 1) = 86, ncD£(l,0,0) = 78 

ncfl£;(0,l,l) =318, ncDE(0,l,0) = 138, ncDE(0,0, 1) = 124, ncDE{0, 0, 0) = 167. 

nz)(l) = 735, ni5(0) = 455 

ncD(l,l) = 279, ncD(l,0) = 164, ncD(0, 1) = 456, ncD(0,0) = 291 

nDE{l,l)=526, nDE{l,0) = 209, nDE{0,l) = 210, nDE{0,0)=245 
so that 

G{C, E\D) = 2xJ2E ^CDEic d, e) log ^cnE{cd,e)nn{d) ^ ^^^^ ^ ^^^ ^ ^, 
^^ ncD{c,d)nDE[d,e) 

The hypothesis that C 1. E\D is rejected, although not by much; the p value is only slightly less than 
5%. 

The problem here is not the lack of a faithful graphical model. Rather, it is that the conditional 
independence statements that have been accepted are inconsistent with the conditional independence 
statements that have been rejected. By theorem [21 

C±E\{B,D} and {B,D} ±C ^C ±E 

C _L E\{B, D} and B ± C\D ^ C _L E\D. 

Now, the individual hypothesis tests Hq : i? _L C, Hq : D _L C and Hq : B J- C\D each result in 
Hq not rejected. But C JL E is accepted and C JL E\D is also accepted. Therefore, the statement 
C JL E\{B, D} should also be accepted, otherwise inconsistencies appear in the independence structure, 
leading to a graph that indicates d-separation where conditional dependence has been established. 
Any undergraduate engineering student will fail their introductory course in statistics if they have not 
learned that 'do not reject the null hypothesis' does not mean 'accept the null hypothesis'; this example 
provides an illustration of the importance of this simple basic principle. 

When the modification in the way that conditional independence is established is used, the final sets 
of neighbours are 

Za = cP, Zb = {D,E}, Zc = {E,F}, Zd = {B,E,F}, Ze = {B,C,D}, Zf = {C,D}. 
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Figure 7: Women and Mathematics - Essential graph obtained using M'^PC algorithm 

When the modification to locate the immoralities is introduced, there are five vee structures to be 
checked: B-E-C,C-E-D, E-D-F,C-F-D,E-C-F. Since B ±C, the algorithm 
declares (B, E, C) is an immorality. Since C _L D, the proposed algorithm declares that C — E — D 
is an immorality. Since E _L F\{B, C, D} (from stage 1 of the algorithm), it follows that E — D — F 
is not an immorality, since D is in the conditioning set. Since C _L D, it follows that C — F — D is 
an immorality. Since E ± F\{B,C,D}, it follows that E — C — F is not an immorality. Once the 
immoralities have been located, the only remaining undirected edge is B — D. Since it does not belong 
to any of the structures in figure |5l it remains undirected. The essential graph obtained is given in 
figure [T] 

A Note on Faithfulness It is important to note that the set of conditional independence relations 
constructed by the M^PC algorithm, with the modification to ensure that the conditional independence 
statements accepted are consistent with those rejected, do not correspond to a distribution that has a 
faithful graphical model. In the essential graph in figure [71 B / C\\gE. That is, B is not d-separated 
from C when E is instantiated. But 

G(B,C;i?) = 0.617 < 5.99 = X2;o.05 

so that the data does not provide evidence for B ^ C\E. 

There is a very significant conditional dependence between C and D given E] 

G(C,D;E) = 56.57 > 5.99 = X2;o.05 
which is reflected in the immorality C — E — D in the graph, while C ^ D. 

Another note on faithfulness As mentioned earlier, 

G{E,F\C,D) = 9.83 > 9.45 = xl;0.95 

so that E JL F\{C, D} is accepted at the 5% significance level. Nevertheless, in figure[71 E _L F\\g{C, D}; 
any trail between E and F has to pass through an instantiated fork connection. 

7.2 Comparisons with the MMHC Algorithm 

An extensive comparison of the MMHC algorithm against other structure learning algorithms is made 
in |38j and the reader is referred to [38] for these comparisons. With reference to these results, it is of 
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interest to understand where the modifications proposed for the M^PC algorithm would influence the 
results. 

As noted earlier, the MMHC algorithm is carried out in two parts; firstly the MMPC algorithm to 
locate the skeleton and then the edge orientation phase to produce a directed acyclic graph. The edge 
orientation phase uses the edges in the skeleton and at each step the algorithm modifies a single edge. 
The algorithm starts with the empty graph. From the entire set of possible edges, which are those 
located by the MMPC algorithm, it chooses the add edge / delete edge / reverse direction operation 
that produces a network that has not been considered previously, with the greatest increase in the BDeu 
score function. This continues until there have been 15 steps without any increase in the maximum 
score function encountered. The algorithm returns the network corresponding to the maximum score 
function encountered. Each operation will change the parent set of at most two variables. Add edge 
and delete edge changes the parent set of a single variable, reverse edge changes the parent sets of two 
variables. The updated score function when a single edge is obtained by updating the previous score 
function and requires at most one statistical call to the data set. 

Suppose the skeleton contains S edges. If each edge, with an orientation that corresponds to a 
directed acyclic graph with the correct essential graph were located at each step, so that no delete 
edge or reverse edge operations were necessary, the total number of statistical calls necessary would 
be bounded above by 25*^, assuming that the algorithm stops as soon as all the edges are added. The 
number of steps required by the MMHC algorithm is usually substantially larger than 25^. 

For each edge not present, it may be added in one of two ways; for each edge already present, it 
may either be deleted or re-oriented. The algorithm works by checking all the 'legal' possibilities, i.e. 
those that result in a directed graph without cycles. 

An easy lower bound on the number of calls, assuming that one call is required for each iteration, 
is given by S'^. 

The following 'back of an envelope' calculation gives a ball park figure for the number of calls in 
terms of the size of the parent sets. Let IT denote the average parent set size, then the number of edges 
is given by S = Hd; if Q is the average size of the neighbour sets for each neighbour, then there is a 
lower bound of ■^Q'^d? on the number of statistical calls. 

It is pointed out in |38| that for the test networks used in that article, it is usual that ~ 0{C), 
where C is the average size of the neighbour sets after the first stage of the M^PC algorithm. The 
lower bound for the edge orientation phase is of order jC'^d'^, while an upper bound for the number 
of calls for the MMPC algorithm, with a modification to locate the immoralities, is d{d — 1) + eC d 
(given by equation ([5])) and an upper bound for the M^PC (both modifying the method to determine 
independence and also locating the immoralities) is given by d{d—\)^{l + 2)2^~^ (equation ([8])), where 
C is the size of the largest neighbour set and / is the size of the largest conditioning set used. 

If the complexity is measured by the number of statistical calls, where a statistical call means a 
operation that involves using the entire data matrix, then the upper bound for the complexity of the 
M^PC algorithm is d{d — 1) calls, and this is the same value, whether or not the modifications are 
employed. 
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Table 2: Sizes of parent sets for the ALARM network 



Evaluation results from 

found on 



The article [38] evaluates the MMPC algorithm using test networks 



http : //www . cs . huj i . ac . il/labs/compbio/Repository/networks . html 



These are often used in evaluation, to provide an objective standard for assessing the performance of 
a structure learning algorithm. In |38], section 9.2.7, several examples are given of the performance of 
the Minimum Maximum Hill Climbing algorithm when applied to the 'ALARM' network, one of the 
networks on the web page above, and various networks produced by tiling the ALARM network. That 
is, to get a larger network, the ALARM network is repeated many times, with suitable edges added to 
connect the various tiles. Some results are given for: 

• a single copy of ALARM (37 variables and 46 edges) 

• ALARM 10 - 10 tiles of ALARM 

• 135 tiles of ALARM (4995 variables and 6845 edges) 

• 270 tiles of ALARM (9990 variables and 13640 edges) 

The results in [38j for the third of these, with 4995 variables are instructive, because it is the only 
place in that article where separate indications are given of the time taken for the MMPC stage of the 
algorithm and the edge orientation stage of the algorithm. 

Only the MMPC algorithm is applied to the network with 9990 variables. 

The number of missing edges and extra edges are given for the single copy and the tiling with 10 
copies. 

For a single copy of the ALARM network, the sizes of the parent sets and neighbour sets, and the 
frequency with which these size occurs, are given in table [2] 

For the example with 4990 variables, the MMPC stage took 19 hours, while the edge orientation 
stage took approximately 300 hours. That is, the edge orientation stage of the algorithm took 15 times 
as long as the skeleton reconstruction phase. The complexity of the M^PC algorithm is similar to the 
complexity of the MMPC algorithm of |38| . so the computational advantage is clear. 

For the network with 9990 variables and 13640 edges, only 1000 training examples were used. The 
time taken was reported in |38) as 62 hours. Compared with the amount of time taken by the network 
on 4995 variables, the ratio of the time taken for the ratio of small: large network is 1 : 3.26, which 
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is considerably less than 1 : 4, the ratio expected if the time taken is proportional to d{d — 1). This 
may well be because the four parallel processors were working faster than four times the speed of a 
single processor. The network with 5000 variables was reconstructed using a single processor, while the 
network with 10000 variables was reconstructed using four parallel processors. There may have been 
proportionately fewer calls, because fewer training examples were used, so that fewer test statistics 
could be computed with accuracy. 

Accuracy results from |38j For the various networks, the accuracy of reconstruction was as follows: 

1. The ALARMIO network (370 variables, 500 edges) Based on repeated trials, each with 5000 
training examples, the reconstruction in |38) had an average of 27.2 extra edges and 109.0 missing 
edges. 

2. The network constructed from 135 tiles of the ALARM network (4995 variables and 6845 edges) 
Based on a single run, using 5000 training examples, the reconstruction had 1340 extra edges 
and 1076 missing edges. 

3. The network constructed from 270 tiles of the ALARM network (9990 variables and 13640 edges) 
Based on a single run, using 1000 training examples, the reconstruction had 11068 extra edges 
and 2572 missing edges. 

The networks were used to generate random data according to the distribution. The algorithm was 
then applied to the randomly generated data to reconstruct the network. 

Extra Edges Although the reconstructed networks for 370, 4995 and 9990 variables all contain edges 
additional to the underlying graph structure of the probability distribution used to generate the data, 
this reflects the data rather than the algorithm. The algorithm rejects an edge if the direct association 
is insignificant at the 5% level, or if any of the conditional associations computed are insignificant at 
the 5% level. Therefore, edges appearing represent associations genuinely present in the data, even if 
the associations were not present in the underlying distribution from which the data was generated. 

Consider d independent variables. Suppose that, in the data generated, two variables are associated 
with probability p. Let X denote the number of associations in the data, then X ~ Bi[-^d{d — l),p) 
which, for ^ P ^ 10 and p small may be approximated as N{^d{d — 1), ^d{d — 1)). If d = 10000 
and p = 2 X 10~^, then 

p{X > 10000) ~ 0.5 and p{X > 9700) ~ 0.99. 

The numbers of 'extra edges' quoted by [38] in their results would be consistent with the number of 
genuine associations that may be expected in data if the probability of an association appearing in the 
data between two independent variables is of order 10"'^. The assumption of faithfulness ensures the 
rejection of edges unless a clear association is shown; the method of accepting conditional independence 
statements increase the probability of rejecting an edge even when there is a clear association in the 
data. 
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Missing Edges The results for the ALARM networks with 500, 6845 and 13640 edges in |38] show 
109 (averaged over several trials), 1340 and 2572 missing edges respectively. This is a substantial 
number, representing 22%, 20% and 19% of the edges respectively in the underlying distributions from 
which the data is obtained. This suggests that the edges are missing for the same reason that the 
edge between C and E is rejected in the example in subsection 17.11 conditional independence tests 
with larger numbers of variables fail to reject a hypothesis and thus imply independence for a smaller 
conditioning set where independence has already been rejected. 

8 Conclusion 

The Maximum Minimum Hill Climbing algorithm, introduced by Tsamardinos, Brown and Aliferis is 
a powerful algorithm for locating the skeleton of a faithful graph, if there exists a faithful Bayesian 
network, for large networks, where the dependency structure is sparse. This article takes the first stage 
of the MMHC algorithm, the stage where the skeleton is determined, and locates the immoralities 
without requiring additional statistical calls. This enables the essential graph to be constructed, 
which captures all the Markov properties that are contained in any faithful directed acyclic graph. 
Its implementation is feasible whenever the implementation of the MMPC algorithm of Tsamardinos, 
Brown and Alifeis is feasible and it leads to a substantial decrease in the complexity compared with 
the MMHC algorithm. 

The method for determining conditional independence statements used in the MMPC algorithm 
may lead to inconsistencies; an independence statement that is not rejected is accepted, and may 
contradict earlier dependence statements that have been established. This difficulty appears in the 
WAM data, discussed in subsection 17.11 A method for producing a set of independence statements 
that are consistent with each other is therefore proposed. 

The conditional dependence / independence relations established may not have a faithful graphical 
representation. This is motivated by the WAM data set of subsection 17. 1| a randomly chosen data 
set of 1000 observations on 6 binary variables, which exhibits conditional dependence / independence 
relations that do not have a faithful graphical representation. An economical method for locating an 
essential graph with minor modification to the skeleton is suggested that returns an essential graph 
containing all the dependencies derived for the construction of the skeleton, although this is not ex- 
haustive; if the faithfulness assumption does not hold, the procedure may miss some dependencies that 
were not established in the skeleton construction phase. 

The value of the first of these modifications is clearly gives a substantial reduction of the time 
required. The second leads to greater accuracy, dealing with a situation motivated by the WAM 
example (subsection 17. ip . The third of these, inserting algorithm 1141 is necessary if the faithfulness 
assumption is not guaranteed. 

The WAM was a randomly chosen data set, chosen simply for its convenience as a six variable set 
that had already been used to test other algorithms and not because it was imagined that it might 
illustrate lack of faithfulness. This indicates that lack of faithfulness is not rare and that situations 
requiring the modification in the procedure for determining conditional independence are not rare. 
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Unfortunately, while programming the algorithm proposed in this article is straightforward enough, 
and while initial results look promising, the author did not have sufficient computational power available 
to engage in larger experiments. The additional time taken when the modification for determining 
conditional independence is included is therefore unknown for larger networks. The purpose of the 
article was to a) point out the theoretical point that the computations made in the MMPC algorithm, if 
used properly, already give the immoralities and hence Markov structure of the network, thus rendering 
the edge orientation phase of the MMHC algorithm unnecessary, b) point out the theoretical problem 
that could occur with the method for determining conditional independence, show that it does happen 
in a straightforward real data set and propose a solution that represents all associations established 
from data and c) point out, using a real data set, that lack of faithfulness occurs in practise and 
therefore has to be considered in a structure learning algorithm. 

9 Appendix 

The appendix describes the background material so that the article is self contained. Let V = 
{Xi, . . . ,Xci} denote a collection of random variables, each with a multinomial distribution, which 
are related, with joint probability distribution p. In the article, Q = {V, E) is used to denote a graph 
with variable set V and edge set E. In general, E = D UU where D is the set of directed edges and 
U is the set of undirected edges. If the graph is directed, the notation Q = {V, D) will be used and if 
undirected, the notation Q = {V^U) will be used. The notation {Xi,Xj) is used to denote a directed 
edge between two variables Xi and Xj . The notation {Xi , Xj ) is used to denote an undirected edge 
between two variables. In diagrams, the notation 

®^@ 

denotes (Xi ,Xj), a directed edge from Xi to Xj and 



[X,) (X, 

denotes {Xi ,Xj), an undirected edge between Xi and Xj . 

For sets of random variables A, B and C, A ^ B denotes that A is independent of B and A J- B\C 
denotes that A is conditionally independent of B given C. A JL B denotes that A is not independent 
of B and A JL B\C denotes that A is not conditionally independent of B given C. 

Definition 12 (Directed Acyclic Graph). A graph Q = {V^D) where V is the set of nodes and D the 
set of edges is said to be a directed acyclic graph if each edge in D is directed and for any node X ^V 
there does not exist a distinct set of nodes Yi, . . . ,Yn for n > 1 such that X ^ Yi for all i = 1, . . . ,n, 
the edge (X, Yi) G D, (y„, X) e D and {Yi,Yi+i) e D for i = 1, . . . ,n - 1. 

There are several possible terminologies for the various connections in the following definition; this is 
one of the standard ones. 
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Definition 13 (Fork connection, Chain connection, Collider connection). Two variables Xi and X^ 
are connected by a fork connection, via a third variable X2 if the structure is 



iXi)< 1X2) yiXs 

Xi and X3 are connected by a chain connection via X2 if the structure is 

(g) y(x^ y(x^ 

Xi and X3 are connected by a collider connection via X2 if the structure is 

(x^ — >(x^< — (xj) 



The nodes at the centre of a fork, chain or collider connection may be referred to as fork, chain or 
collider nodes. 

Definition 14 (Trail). A trail between two nodes X and Y is a collection of nodes Zi, . . . ,Zn such 
that, setting X = Zq and Y = Zn+i, for i = 0, . . . ,n, (Zi, ^i+i) or (^i+i, Zi) or {Zi, ^j+i) are in E. 

Definition 15 (Directed Path). A directed path between two nodes X and Y is a collection of nodes 
Z\, . . . ,Zn such that, setting X = Zq and Y = Zn+i, for i = 0, . . . ,n, (Zi, ^i+i) G D. 

Definition 16 (Descendant, Ancestor, Parent, Child). In a directed acyclic graph, X is an ancestor 
ofY if there is a directed path from X to Y . The node Y is a descendant of X if there is a directed 
path from X to Y . The node X is a parent of Y if there is a directed edge {X, Y) & D. The node Y is 
a child of X if there is a directed edge (X, Y) ^ D. 

Definition 17 (S'-active trail, blocked trail). Let Q = {V,D) be a directed acyclic graph. Let S dV. 
A trail between two variables X and Y not in S is said to be S-active if 

1. Every collider node Z is in S or has a descendant in S. 

2. Every other node is outside S. 

A trail between X and Y that is not S-active is said to be blocked by S. 

Definition 18 (Instantiated). Let V = {Xi, . . . ,X(i} denote a set of random variables. A random 
variable Xj is said to be instantiated when the state of the variable is known. 

Definition 19 (d-separation) . Let Q = {V,D) be a directed acyclic graph. Let A, B and S be three 
disjoint subsets of V . Two nodes Xi and Xj are said to be d-separated given S if all trails between Xi 
and Xj are blocked by S. This is written Xi _L XjUgS*. Two sets A and B are said to be d-separated 
by S if for any Xi & A and Xj G B, Xi 1. Xj\\gS. This is written A _L SUgS". 

Definition 20 (Bayesian Network, Factorisation). A Bayesian network is a pair {Q,p), where Q = 
iy, D) is a directed acyclic graph, V is the node set and D is the set of directed edges such that 
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• For each node X^ G V with no parent variables, there is assigned a probability distribution px^ 
and to each node X^ with non - empty parent set n„ = {X (v) , • • • , X (v) ) there is a conditional 
probability distribution px^m^ such that (declaring px^m„ = Px„ if^v = (p the empty set) 

d 

PXr,...,x, = JJpx„in„- 

v=l 

Such a decomposition is known as a factorisation. 

• The factorisation is minimal in the sense that Uj is the smallest subset of {Xi, . . . , Xj^i} such 
that Xj ±{Xi,..., Xj_i}\Uj\Uj. 

The definition of d-separation is motivated by the following consideration. Consider three variables 
Xi, X2, X3 with joint probability function PXi,X2,X3- Suppose that this distribution may be factorised 
PX2PXi\X2PX3\Xn according to the fork node. Then Xi / X3, but Xi _L X3IX2. 

li PXi,X2,X3 = PXiPx2\XiPx-AX2^ then the factorisation is according to the chain connection. Again, 

Xi/X3,butXi±X3|X2. 

If PXi,X2,X3 = PXxPXiPx2\Xx,X3-, then the Bayesian network is the collider connection. Here Xi _L 

X3,butXi/X3|X2. 

These conditional independence statements may be considered as d-separation statements in the 
corresponding directed acyclic graphs, with S = {-^2}- 

If (i-separation holds, then conditional independence also holds. 

Theorem 21. Let Q = {V, D) be a directed acyclic graph where V = {Xi, . . . , X^} is a set of random 
variables. Let p be a probability distribution that factorizes along Q. Then, for any disjoint subsets 
A,B and S of V , it holds that A ± B\S if A 1. B\\gS. That is, if the d-separation statement holds, 
then A is conditionally independent of B given S. 

Proof This is theorem 2.2 found on page 66 of |19) . where a direct proof of the result is given. D 

The converse does not necessarily hold. When it does, a directed acyclic graph G = {V, D) is said 
to be faithful to the distribution. 

Definition 22 (Faithful). A directed acyclic graph Q = (V^D) over a variable set V = {Xi, . . . ,X(i} 
is said to be faithful to a probability distribution p over V if for any three disjoint subsets A, B and S 
of V , it holds that 

A^B\\gS^ A^B\S. 

Definition 23 (Immorality). Let Q = (F, E) denote a graph, where E = DUU, D is the set of directed 
edges and U the set of undirected edges. An immorality is a triple {X, Y, Z) such that {X, Y) G D, 
{Z, Y) G D, but {X, Z) D, {Z, X) ^ D and {X, Z) U. 

Definition 24 (Skeleton). The skeleton of a graph Q = {V., E) where E = DUU is the graph obtained 
by making the graph undirected. That is, the skeleton of Q is the graph Q = (F, E) where {X, Y) ^ E 
if and only if either {X, Y) £ U or (X, Y) £ D or {Y, X) e D. 
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Definition 25 (vee - structure). A vee structure is defined as a triple of variables {X,Y,Z} such that 
the skeleton contains edges {X,Y) and {Y,Z), but no edge {X,Z). 

Definition 26 (Markov Blanket). In a directed acyclic graph, the Markov blanket Ai{X) of a node X 
is the set containing the parents of X, the children of X and the parents of children of X. 

Lemma 27. For any variable Y € y\{a} VJ M.{X), X is d-separated from Y by M.{X). 

Proof of lemma 1271 Exercise 6 on page 77 of jT9] D 



The following theorem is of crucial importance and comes from Spirtes, Glymour and Scheines |35] 

Theorem 28. The skeleton of a faithful graph contains an edge {X, Y) between two variables X and 
Y if and only if X JL Y\S for any set S C y\{X, y} (including the empty set). 



Proof of theorem 1281 If there exists a set S such that X J-Y\S for some set S, then all trails between 
X and Y are blocked by S and hence there is no edge {X, Y) in the skeleton of the faithful graph. 

If there is no set S such that X _L Y\S, then from lemma [271 it follows that Y € A4{X) (the 
Markov blanket of X). If there is no edge between X and Y, it follows that Y is the parent of a 
child of X and is not a parent of X. Using n(X) to denote the parents of X, it therefore follows that 
Y _L X|n(X), contradicting the assumption that there is no set S such that X _L Y\S. 

It follows that if there is no set S such that X _L Y\S, then Y is either a parent or a child of X. In 
both cases, the edge {X, Y) is present in the skeleton of a faithful graph. D 

Theorem 29. Let Q be a DA G that is faithful to a probability distribution p. A vee structure X — Y — Z 
(definition \25\) is an immorality (definition W3(l if and only if there exists a subset S C V\{X., Y, Z} 
such that X _L Z\S. 

If X — Y — Z is an immorality, then for any subset S such that X J- Y\S, it follows that X JL 
Z\SU{Y}. 

Proof If there is no edge between X and Z, it follows that X J- Z\S for some 5 C F\{X, Z}. This 
is seen as follows. If Z is not in the Markov blanket of X, then let S = A4{X) (definition [26]) . 
li Z ^ A4{X) but there is no edge between X and Z, then Z is the parent of a child of X. Let 
S = n(X) U n(Z) (the parents of X and the parents of Z), then X _L 2'||gS' (that is, X and Z are 
(i-separated given S), because any open trail from X to Z cannot take its first step through n(X) 
(instantiated chain or fork) and therefore takes its first step through a child of X and (by symmetry) 
its last step through a child of Z. Therefore, all connections until the first instantiated node are chains 
(otherwise there is an uninstantiated collider) and hence the first instantiated node encountered is not 
in n(X) (otherwise the DAG contains a cycle). Symmetrically, the last collider encountered is not in 
n(Z). It follows that either the trail does not encounter any nodes in n(X) U n(Z), but this is not 
possible, because the sequence of chain connections forces the last node on the trail to be a parent of 
n(Z) which is a contradiction. Otherwise the last collider node encountered is a parent of X, forcing 
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X to be a descendant of Z and the first collider node (to prevent a cycle in the DAG) is a parent of 
Z, forcing Z to be a descendant of Z, which is a contradiction. 

If X — y — Z is an immorality, then Y n(X) U n(Z), for if it is in either of these sets, then it is 
both a parent and a child of at least one of the variables, which is a contradiction. If S" = n(X) un(Z), 
then X ± ZllgS, so there is a set S satisfying the property. It is also clear that X / Z^qS U {Y^. 

Now suppose that there exists a set S € V\\X^Y^Z^ such that X _L Z\S. Since y S, the trail 
X — Y — Z with Y uninstantiated is blocked and hence X — Y — Z \& a, collider connection. 

If X — y — Z is an immorality, then it is clear that for any subset S G V\\X^ y, Z} such that 
X ^ Z\S^ X ^ Z|S'U {y}, because when Y is instantiated, the trail X — y — Z is open when X — Y — Z 
is a collider. D 

Definition 30 (Markov Equivalence). Two directed acyclic graphs Qi = {V^Di) and Q2 = (y,D2) are 
said to be Markov equivalent if and only if for any three disjoint subsets A, B, S of V , 

A ± B\\g,S ^ A ± B\\g^S. 

That is, d-separation statements in one graph are the same as d-separation statements in the other. 

Theorem 31. Two directed acyclic graphs are Markov equivalent if and only if they have the same 
skeleton and the same immoralities. 

Proof The proof of this is found in P.Verma and J. Pearl, corollary 3.2 in [44]. D 

Definition 32 (Essential Graph). Let G be a directed acyclic graph. The essential graph Q* associated 
with Q is the graph with the same skeleton as Q, but where an edge is directed in Q* if and only if it 
occurs with the same orientation in every directed acyclic graph that is Markov equivalent to Q. 

Definition 33 (Strongly Protected Edge). An edge is said to be strongly protected if occurs either in 
an immorality, or else in one of the three structures found in figure O 

Lemma 34. Once the immoralities have been determined, the direction of the edges between X and Y 
in the structures of figure are determined. 

Proof of lemma [34] In the first structure, it is clear that if the edge Z 1— )• X is directed from Z to X, 
then the direction X to y is forced, because otherwise there would be an additional immorality. 

In the second structure, it is clear that once the directions X 1— )• Z and Z 1— > y are forced, then the 
direction X 1— )■ y is forced, because otherwise there is a cycle. 

In the third structure, if (Zi, y, Z2) is an immorality, then the direction X 1— > y is forced, because 
if the edge is in the other direction, then either a new immorality (Zi, X, Z2) is created, contradicting 
the conditional independence statements that have already been established, or else there is necessarily 
a cycle. 

It is also clear that the structures in the definition [33] are the only ones where the directions of 
edges are forced. The justification is as follows: 

Consider two neighbour nodes X and Y which are neighbours. Firstly, suppose that X and Y do 
not have any other common neighbours. If X does not have a neighbour Z such that there is a directed 
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edge {Z,X), then the direction (X, 1") is not forced; no additional immoraUty or cycle is created by 
either direction. 

Now suppose that X and Y have at least one common neighbour. Suppose that there are no 
neighbours Z such that both {X,Z), {Z,Y) in the directed edge set, then it is not necessary to force 
the direction (X, Y) to prevent a cycle. 

Suppose, furthermore, that there is no pair of common neighbours W and Z, such that (Z, Y) 
and (W, Y) are both in the directed edge set and (Z, X, W) is not an immorality. Then it is not 
necessary to force the direction {X, Y) to prevent either (Z, X, W) becoming an immorality or else a 
cycle appearing. D 
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